An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures

Size: px
Start display at page:

Download "An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures"

Transcription

1 An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures Wangyuan Zhang, Xin Fu, Tao Li and José Fortes Department of Electrical and Computer Engineering, University of Florida {zhangwy, Abstract Semiconductor transient faults (i.e. soft errors) have become an increasingly important threat to microprocessor reliability. Simultaneous multithreaded (SMT) architectures exploit thread-level parallelism to improve overall processor throughput. A great amount of research has been conducted in the past to investigate performance and power issues of SMT architectures. Nevertheless, the effect of multithreaded execution on a microarchitecture s vulnerability to soft error remains largely unexplored. To address this issue, we have developed a microarchitecture level soft error vulnerability analysis framework for SMT architectures. Using a mixed set of SPEC CPU 2000 benchmarks, we quantify the impact of multithreading on a wide range of microarchitecture structures. We examine how the baseline SMT microarchitecture reliability profile varies with workload behavior, the number of threads and fetch policies. Our experimental results show that the overall vulnerability rises in multithreading architectures, while each individual thread shows less vulnerability. By considering both performance and reliability, SMT outperforms superscalar architectures. The SMT reliability and its tradeoff with performance vary across different fetch policies. With a detailed analysis of the experimental results, we point out a set of potential opportunities to reduce SMT microarchitecture vulnerability, which can serve as guidance to exploiting thread-aware reliability optimization techniques in the near future. To our knowledge, this paper presents the first effort to characterize microarchitecture vulnerability to soft error on SMT processors. 1. Introduction Semiconductor transient faults (i.e. soft errors) have become an increasingly important threat to microprocessor reliability. Transient faults, also known as soft errors, are caused by cosmic rays or substrate alpha particles that can potentially corrupt program data. With the advance of VLSI technologies, the next generation of microprocessors is projected to be more susceptible to soft error strikes due to the continuously reduced feature size and supply voltage, and increasing clock frequency and on-chip density [1]. The rapidly increasing soft error rates and the advent of billion transistor chips suggest that it will be infeasible to protect every transistor from soft error strikes. At the microarchitecture level, a significant fraction of soft errors can be efficiently masked. Motivated by this observation, a growing number of studies [1, 2, 3, 4, 5, 6, 7, 8] have focused on characterizing microarchitecture soft error behavior. These studies, however, exclusively focus on single thread execution environments. Due to the diminishing instruction level parallelism (ILP) performance gains available on wider issue superscalar processors, simultaneous multithreaded (SMT) architectures [9] have been proposed and used in commercial processors [10, 11] to exploit thread-level parallelism. The SMT architectures improve the performance of a superscalar processor by dynamically sharing pipeline and microarchitecture resources among multiple, concurrently running threads. In the past, the performance and power issues of SMT architectures have been extensively studied [12, 13, 14, 15, 16]. As soft errors continue to become an increasing threat to hardware reliability, it is important to characterize and understand the impact of simultaneous multithreading on processor dependability. The effects of multithreading on reliability can be two-fold. Compared with a superscalar processor on which threads are executed sequentially, a SMT processor can concurrently process instructions from multiple threads to deliver a higher throughput. However, a reliability issue associated with this speedup is that the fine-grained resource sharing and the elevated hardware utilization in SMT processors may expose more program runtime states to neutron or alpha particle strikes at any given time, resulting in an increased microarchitecture susceptibility to transient faults. To characterize and understand hardware vulnerability to soft error on SMT architectures, we have developed a reliability-aware SMT processor simulator. The framework provides cycle-accurate, microarchitecture level reliability estimation for a parameterizable SMT architecture. Using a mixed set of SPEC CPU 2000 benchmarks, we quantify microarchitecture reliability efficiency on SMT architectures with different numbers of thread contexts and compare them with that of superscalar execution. Additionally, we examine the impact of various SMT fetch policies on microarchitecture reliability and further contrast reliability and performance tradeoffs across different workloads. The major observations and conclusions that can be drawn from our study are: For SMT architectures, shared microarchitecture structures are more susceptible to soft error strikes due to their higher resource utilization. Compared to superscalar execution, SMT can still achieve better performance and reliability tradeoffs. In spite of increasing overall microarchitecture vulnerability, compared to superscalar execution, SMT execution is likely to reduce the microarchitecture vulnerability of individual threads /07/$ IEEE 169

2 Fetch policies play an important role in determining SMT microarchitecture reliability domain characteristics and efficiency. Compared to a SMT performance metric such as throughput, the reliability metric is more sensitive to the fetch policy. For example, SMT architectures employing two different fetch policies can result in similar throughput while exhibiting significantly different reliability behavior. Among all fetch policies we examined in this study, and exhibit the most attractive behavior in both reliability and its tradeoff with performance. However, when fairness is considered, their advantages diminish. The rest of the paper is organized as follows. Section 2 provides a brief background on the microarchitecture level soft error vulnerability analysis methods we used in this study. Section 3 describes our simulation framework, methodology, and experimental setup. Section 4 presents a detailed microarchitecture soft error vulnerability profile on a SMT processor. Section 5 discusses the experimental results and highlights thread-aware reliability optimization opportunities. Section 6 summaries related work. Section 7 concludes the paper and outlines our future work. 2. Microarchitecture Soft Error Vulnerability Analysis Several techniques have been proposed to model processor vulnerability to soft error at the microarchitecture level. In [7], Li and Adve estimate reliability using a probabilistic model of the error generation and propagation process in a processor. In the past, statistic fault injection [3, 17] has also been used to evaluate architectural reliability. In this work, we estimate the reliability of SMT architectures using Architectural Vulnerability Factor () analysis methods introduced in [4, 5]. In this section, we briefly describe the computation methods to provide sufficient background for the rest of the paper. A hardware structure s refers to the probability that a transient fault in that hardware structure will result in incorrect program results. The overall hardware structure s error rate is decided by two factors: the device raw error rate, mainly determined by circuit design and processing technology, and the. Since the hardware raw error rate does not change with code execution, the can be used as a reliability metric to estimate how vulnerable the hardware is to soft errors during program execution. To compute, one needs to classify hardware states into bits that store information that can affect the final program output and those that can not. The processor state bits required for architecturally correct execution (ACE) are called ACE bits [4]. In a given cycle, the of a hardware structure is the percentage of ACE bits that the structure holds. The of a hardware structure is derived by averaging the s of the structure across program execution. In reality, it is much more feasible to identify un-ace bits, the processor state bits that do not affect correct program execution. Examples of locations that contain un-ace bits include NOPs, idle or invalid states, uncommitted instructions, dynamically dead instructions and data, and cache lines that will not be accessed before eviction. A cycle accurate execution driven performance model can be used to identify un-ace bits and to track the residency cycles of un-ace bits in hardware structures. Through cycle-level simulation, processor microarchitecture states are classified into ACE/un-ACE bits and their residency and resource usage counts are generated. This information is then used to compute the of various hardware structures. To estimate the reliability of the entire processor, one can add the values of all of the hardware structures together by weighting them by the number of bits within each structure. In this work, we focus on computing the reliability of an individual microarchitecture structure since we believe that a component-based vulnerability analysis provides readers better knowledge and insight on the impact of SMT techniques on processor reliability. 3. Experimental Setup We have developed a framework that estimates the architectural and microarchitectual effects of soft errors on SMT architectures. Our reliability-aware SMT simulation framework is built on a heavily modified and extended M- Sim simulator [18], which models a detailed, execution driven simultaneous multithreading processor with both shared architecture components such as the instruction queue, physical register file pool, function unit and cache as well as private structures for individual threads including separate rename tables, program counters, reorder buffers, branch predictors and load/store queues. To model reliability, we implement and extend the computation methods proposed in [4, 5] to support SMT architecture. As a result, our framework is capable of tracking the microarchitecture vulnerability contributed by each individual thread and summarizing the aggregated microarchitecture vulnerability due to multithreading. Our SMT reliability analysis framework covers a wide range of shared and non-shared microarchitecture components including the instruction queue, register file, function unit, reorder buffer, L1 data cache, TLB and load/store queue. At the end of each simulation, the framework outputs reliability estimates along with performance data. Table 1 shows the baseline machine configuration we used in this study. In a SMT processor, instruction fetch policies largely decide how processor resources are shared among threads and therefore play an important role in determining the overall performance of a SMT processor. We use [19] as the baseline fetch policy. The fetch policy assigns the highest priority to the thread that has the fewest in-flight instructions. In Section 4.3, we further examine a set of advanced fetch policies such as [20], [21], P [21], DWarn [22] and 170

3 [20] to investigate the impact of fetch policies on SMT microarchitecture reliability. Table 1. Simulated Machine Configuration Parameter Configuration Processor Width 8-wide fetch/issue/commit Baseline Fetch Policy Pipeline Depth 7 Issue Queue 96 ITLB 128 entries, 4-way, 200 cycle miss Branch Prediction 2K entries Gshare, 10-bit global history per thread BTB 2K entries, 4-way per thread Return Address Stack 32 entries L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 96 entries per thread Load/ Store Queue 48 entries per thread Integer ALU 8 I-ALU, 4 I-MUL/DIV, 4 Load/Store FP ALU 8 FP-ALU, 4 FP-MUL/DIV/SQRT DTLB 256 entries, 4-way, 200 cycle miss latency L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle access L2 Cache unified 2MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 64 bit wide, 200 cycles access latency The SMT workloads in our experiments are comprised of SPEC CPU 2000 integer and floating point benchmarks. Because the characteristics of SMT workloads vary significantly and strongly depend on both the number of threads and the individual thread behavior, we create a set of SMT workloads with the number of threads ranging from 2 to 8 contexts and with the individual thread characteristics ranging from computation intensive to memory access intensive (see Table 2). We first categorize a SPEC benchmark into CPU intensive (CPU) or memory intensive () based on its IPC and cache miss rate after performing a simulation of 100M instructions from the selected execution point. The CPU and workloads consist of programs all from the CPU intensive and intensive groups respectively. Half of the programs in a SMT workload with mixed behavior () are selected from the CPU intensive group and the rest are selected from the intensive group. To ensure that our experimental results are not biased by a specific set of threads; we build two groups for each type of SMT workload and report average statistics wherever possible. The only exception is the 8 context workloads due to the insufficient number programs that can be used to form two workload groups with enough diversity. We use the Simpoint tool to pick the most representative simulation point for each benchmark and each benchmark is fast-forwarded to its representative point before detailed multithreaded simulation takes place. The simulations are terminated once the total number of simulated instructions reaches 50 million, 100 million and 200 million in 2, 4, and 8 context workloads respectively. Due to the high cost of computation methods and the large number of simulations we perform in this study, we are unable to simulate a larger number of instructions. Number of Contexts 2-Thread 4-Thread 8-Thread Table 2. The Studied SMT Workloads Thread Type Workload Group A Workload Group B CPU bzip2, eon facerece, wupwise eon, twolf wupwise, equake mcf, twolf equake, vpr CPU bzip2, eon, perlbmk, mesa, gcc, perlbmk facerec, wupwise gcc, mcf, prlbmk, mesa, vpr, perlbmk twolf, applu mcf, equake, twolf, galgel, vpr, swim applu, lucas gap, bzip2, facerec, gap, crafty, gcc, CPU eon, mesa, perlbmk, eon, mesa, perlbmk, parser, wupwise fma3d, wupwise perlbmk, mcf, bzip2, vpr, mesa, swim, eon, lucas crafty, fma3d, applu, twolf, equake, mgrid, wupwise, perlbmk mcf, twolf, swim, lucas, equake, applu, vpr, mgrid We use several metrics to quantify the reliability domain behavior of SMT architectures. The is used as a baseline metric to estimate how susceptible a microarchitecture structure is to soft error strikes. However, raw values are likely to be misleading, as they can be deflated by stretched execution cycles. In this study, we further use the Mean Instructions to Failure (MITF) metric to reason about the tradeoff between performance and reliability [6]. MITF represents the amount of work a processor can complete, on average, between two failures. A higher MITF implies a greater amount of work accomplished between errors, which is desirable. At a fixed frequency and raw error rate, MITF is proportional to the ratio of IPC to ( IPC / ). Results throughout this paper are reported using both metrics. 4. Microarchitecture Soft Error Vulnerability of SMT Processors This section presents a detailed study of microarchitecture vulnerability to soft error on SMT processors. We start by profiling the reliability domain characteristics across different microarchitecture structures in a SMT processor. We then contrast superscalar and SMT execution in terms of their reliability characteristics. Finally, we analyze how different fetch policies affect SMT architecture reliability domain behavior. 4.1 SMT Microarchitecture Vulnerability Profile We first set up a set of experiments to answer the following questions. (1) What is the microarchitecture reliability profile of the studied SMT processor? (2) Which SMT microarchitecture structures are more susceptible to soft error strikes? (3) Is SMT architecture beneficial to the reliability of individual threads as well as the whole multithreaded workload? (4) Do the above SMT microarchitecture reliability profiles change with different workload mixes? 171

4 Figure 1 shows the microarchitecture vulnerability profile of the studied SMT processor running 4-context multithreaded workloads. The microarchitecture structures are grouped together as shared pipeline structures, shared memory structures and non-shared structures. In general, the shared structures exhibit a higher than the nonshared structures. Among the shared pipeline components, the instruction queue (IQ) and register files show higher vulnerability. This is because the exploiting of thread level parallelism results in improved resource utilization on these structures. For example, by taking advantage of available idle entries in the IQ, more ACE bits are brought into the instruction queue waiting for execution, increasing the amount of time they are exposed to soft error strikes and thus making the IQ more susceptible to soft error strikes. A similar scenario occurs with the register files. Instead of sitting idle, more physical registers in the pool are allocated. Surprisingly, among the shared memory structures, we find that the DL1 tag exhibits a higher vulnerability than the DL1 data array. Although intuitively the data array contains bits which are more critical to program execution, only a portion of the cache block is read and written upon a given memory reference. Soft error strikes on bits that are not accessed by the processor will not affect the program results. Unlike the data array, however, all of the tag bits are used to check for a match. 3 1 IQ FU Reg DL1_data DL1_tag ROB LSQ_data LSQ_tag Shared Pipeline Structures Shared Memory Structures CPU Non-shared Structures Figure 1. Microarchitecture Vulnerability Profile of the studied SMT Processor (# of context = 4) Compared to the CPU-bound workloads, the memorybound workloads yield a higher on microarchitecture structures that are used to extract instruction level parallelism (e.g. IQ, register file, ROB and LSQ). For example, memory-bound workloads increase the by 58%, 61%, 82% and 94% of the IQ, register file, ROB and LSQ respectively. This is because the memory-bound workloads increase the cache miss rate. Upon a cache miss, the instruction experiencing the cache miss as well as all instructions along the dependency chain can not be processed in the pipeline until the cache miss is served. The ACE bits contributed by these instructions will stall in the above structures for a long period of time, especially when an L2 cache miss occurs. The increased ACE bit residency increases the of these structures significantly. On the other hand, the s of the function unit (FU) and the DL1 data array are reduced in workloads. The reduction in the function unit is due to the increased fraction of idle cycles caused by the diminished ILP in memory-bound workloads and stretched execution times. As for the DL1 data array, the memorybound workloads generate more evictions and refills which convert ACE cycles to un-ace cycles. More specifically, CPU accesses to the same cache block might not be clustered. There might be a long time between two cache hits on the same block. During this period, soft error strikes on a portion of data within that block which will be accessed later can lead to an error. On workloads, the increased number of cache misses can cause blocks to be evicted and then refilled more frequently. The increased cache line competition reduces the residency cycles during which a cache block is present, which in turn reduces the of memory-related structures on workloads. IPC/ CPU IQ FU Reg DL1_data DL1_tag ROB LSQ_data LSQ_tag Shared Pipeline Structure Shared Memory Structure Non-shared Structure Figure 2. Microarchitecture Reliability Efficiency (Measured as IPC/) (# of context = 4) Figure 2 shows microarchitecture reliability efficiency, as measured using the IPC/ metric, across different workload mixes. As can be seen, SMT microarchitecture yields the highest reliability efficiency on CPU-bound workloads, which means a greater amount of work can be completed between errors. This suggests that despite the higher resource utilization which can increase the quantity of ACE bits stored within microarchitecture structures, the residency cycle reduction for ACE bits on high ILP workloads results in an overall improvement in reliability efficiency. Having profiled the baseline SMT microarchitecture reliability characteristics, we set up a set of experiments to contrast the reliability of single thread execution to that of SMT execution. Our objective was to answer the following question. Is SMT execution beneficial to the reliability of individual threads? In single thread execution, a thread can occupy all pipeline resources and therefore take advantage of exclusive access to all of the processor s resources to exploit maximum ILP. In SMT execution, a thread only uses a portion of all available resources and thus progresses slower. The number of ACE bits that each thread has exposed to soft error strikes is correlated to the scale of its allocated resources, while the ACE bit residency time depends on the thread s execution speed. Since the overall is governed by both of these factors, increasing with the quantity of ACE bits and decreasing as the residency time is decreased, there is no simple answer to this problem. Figure 3 breaks down the IQ, ROB and FU of each individual thread in the SMT execution (4 contexts) and compares it to that of single thread execution. To allow for

5 direct comparison of results, we record the progress of each thread in the SMT execution and then simulate the same amount of instructions for each thread in the single thread execution mode. This ensures that the work completed in sequential single thread execution is identical with that completed in concurrent SMT execution. Figure 3 shows that for individual threads, the single thread execution yields higher microarchitecture vulnerability than the vulnerability contributed by the same thread in the SMT execution mode. For example, the IQ and ROB contributed by gcc drops by 91% and 87% respectively when it is paired with mcf, vpr, and perlbmk in SMT execution. This indicates that for individual threads, the increase in due to fully occupying pipeline resources out-weighs the decrease in due to the acceleration in processing ACE bits. 8 6 IQ_ST FU_ST ROB_ST IQ_SMT FU_SMT ROB_SM T bzip eon gcc perlbmk gcc mcf vpr perlbmk mcf equake vpr sw im all all all threads threads threads CPU CPU Figure 3. Microarchitecture Vulnerability: SMT vs. Single Thread (ST) Execution The aggregated contributed by all of the threads in SMT execution exceeds the weighted yielded by the microarchitecture when all of the threads are executed sequentially. The weighted in sequential execution is derived using an individual thread s weighted by the fraction of work that each thread completes. It is not surprising to observe a 2X increase in the IQ on a 4- context SMT processor where four CPU workload threads execute simultaneously and much more pressure is exerted on the IQ. Function units experience a similar trend as exploiting both ILP and TLP can keep function units busy more frequently. The ROB reduction in SMT execution is somewhat counterintuitive. During SMT execution, instruction dispatching from the ROB to the IQ is more likely to be stalled due to competition for dispatching bandwidth and available IQ entries among different threads, resulting in additional dispatch delay which in turn increase the residency time of ACE bits in the ROB. However, there are additional factors that can affect the ROB. A closer investigation indicates that the size of the register file limits the ROB utilization during SMT execution. Since the register file is a shared resource and its size is fixed, the number of available registers that can be used for each thread to perform renaming diminishes with the number of threads. This in turn restricts the number of ROB entries allocated for each thread. This differs from single threaded execution, where a thread has exclusive access to all registers, allowing the allocation of more ROB entries. For our experiments, the 0.83 reduction in available registers to each thread significantly reduces ROB utilization and. IPC/ IQ_ST FU_ST ROB_ST IQ_SMT FU_SMT ROB_SMT bzip eon gcc perlbmk gcc mcf vpr perlbmk mcf equake vpr sw im all all all threads threads threads CPU CPU Figure 4. Microarchitecture Reliability Efficiency: SMT vs. Single Thread (ST) Execution As stated in Section 3, can be used to reflect how vulnerable a hardware structure is to soft errors. However, raw can be misleading when it is used to compare the reliability of different design alternatives. Reliability efficiency offers insight into the tradeoffs between performance and reliability. Figure 4 shows the reliability efficiency of SMT execution and single thread execution. Several interesting observations can be made from our experimental results. First, the reliability efficiency of the FU for all threads is quite similar. This can be explained as follows: is computed using the number of ACE bit residency cycles divided by the total number of execution cycles. IPC includes the total number of execution cycles in its denominator as well. Therefore, the IPC/ metric removes the number of executed cycles and leaves only information on the amount of work that is completed. Because the same number of instructions is processed in SMT and stand-alone execution, IPC/ is the same for the FU. Second, we find that on CPU workloads, standalone execution generally yields better reliability efficiency on the IQ, whereas on workloads, SMT execution can be beneficial to enhance the IQ reliability efficiency for individual threads. However, this trend becomes unpredictable on workloads. Essentially, threads in CPU workloads benefit from exclusive access to all resources in single threaded execution, although this increases the number ACE bits brought into the IQ. This increase in is offset by a relatively larger increase in performance and thus the improved performance dominates the IPC/ results. While the overall throughput is increased during SMT execution, both the performance of each individual thread and the quantity of ACE bits contributed by individual threads is decreased, and thus both the IPC and the are scaled down proportionally. On the workloads, threads show high vulnerability during stand-alone execution and yet are unable to achieve much of an improvement in throughput on SMT execution due to the high cache miss rates across all of the threads. Reliability efficiency is therefore primarily determined by the increase in. Overall, the IQ is more burdened in SMT architecture than in superscalar architecture and leads to a higher vulnerability. Comparing the overall of multithreaded execution versus the aggregated of 173

6 superscalar execution, we find that although the IPC/ metric is biased towards single threaded execution for some individual thread/structure cases, when considering the overall reliability efficiency of workloads, SMT architecture outperforms superscalar for all of the cases except the IQ on CPU workloads. This exception is due to the relatively large increase in as compared to that of performance 4.2 Microarchitecture Reliability vs. Number of Contexts Figure 5 illustrates the SMT microarchitecture trend as the number of thread contexts is increased. Shared structures such as the IQ show a steady increase in as more threads are added. The of the register file increases rapidly from 2-context workloads to 4-context workloads. A limited increase is observed when the number of contexts is further increased from 4 to 8, especially on and workloads. This is because as the number of contexts is increased, or more threads become memory bounded, an increased number of instructions can experience cache misses due to resource contention and poor cache behavior. These instructions, especially those experiencing L2 cache misses, can hold allocated registers for a long period of time. By further investigating the register life cycle, we found that registers remain in an allocated state without holding valid data until the write back stage; soft error strikes during this period of time have no effect on the register values, since the registers will be overwritten during the write back stage. So the time period from the renaming stage to the write back stage is treated as an un-ace period, meaning that it is not vulnerable. A long un-ace period will reduce the fraction of time that a register holds valid data leading to reduced vulnerability. Interestingly, we find that on workloads, the DL1 data array decreases as the number of contexts increases. This is because handling additional DL1 misses increases the frequency of evictions and updates in DL1 operations. The ACE bit life cycle is reduced. On the other hand, compared with workloads, workloads are able to deliver higher throughput and finish execution in shorter periods of time. Because of the reduced execution time, the percentage of ACE bits within the structure is larger at any given time in the workload, and this is reflected by higher in the workload than in the workload. For DL1 data array, the increased is caused by an increased number of threads, memory reference frequency and footprint. The FU shows heterogeneous behavior when adding more thread contexts. On CPU workloads, increasing thread contexts from 2 to 4 improves function unit utilization and. However, when the number of contexts reaches 8, threads aggressively compete for resources, resulting in more contention and hence extended total execution time. By definition, is determined by ACE bit residency time over total execution cycles. When the whole execution is stretched, the is reduced. This observation holds valid on the workload. On workloads, the limited ILP exploited across multiple threads does not reach the upper bound processing capability of the pipeline; therefore, FU continuously increases when more threads are added. 6 IQ FU ROB Reg CPU 6 LSQ_tag DL1_tag LSQ_data DL1_data CPU Figure 5. Microarchitecture Vulnerability vs. Number of Contexts 4.3 Impact of Fetch Policy on SMT Microarchitecture Reliability In the performance domain, the overall throughput that a SMT processor can deliver highly depends on the fetch schemes, which control the quantity of instructions that are brought into pipeline from multiple threads. From a reliability perspective, the fetch policies affect the quantity of ACE bits that can co-exist in a SMT processor pipeline as well as the speed that the pipeline can process these ACE bits. In this section, we quantify the impact of various fetch policies on SMT microarchitecture reliability. There has been abundant work on optimizing frond-end SMT fetch mechanisms in past literature. From among those, we selected five fetch policies (e.g.,,, P, DWarn and ) to use in our study. The above five fetch policies differ with each other in terms of how they react to long-latency instructions. Long-latency instructions can significantly increase the residency cycles within the microarchitecture structures and increase the. frees up pipeline resources for threads suffering L2 cache misses by squashing the dispatched instructions from the offending threads. There are several alternative schemes to determine when to flush the pipeline. We implement a policy that flushes the pipeline from the first instruction following a cache miss instruction. prevents instruction fetching for threads that have L2 cache misses but always allows at least one thread to continue fetching instructions. and P stop fetching once a thread has several outstanding L1 cache miss instructions. The difference between and P is P predicts L1 cache misses to minimize the delay of decision making. Instead of stopping fetching instructions, DWarn policy assigns lower fetch priority to threads with outstanding data cache misses. Figure 6 shows the of SMT microarchitecture running on 4-context and 8-context workloads under different fetch policies. Among all of the fetch policies we examined, shows the most distinct behavior. The of both shared and non-shared microarchitecture structures (e.g. IQ, ROB and LSQ) reduces significantly when the fetch policy is applied. For example, the 174

7 IQ, ROB and LSQ under the fetch policy are only about 5 of the under other fetch policies. The reason for this is, upon a L2 cache miss occurrence, the ACE bits contributed by instructions which have a dependence on the cache miss will remain in these structures for at least hundreds of cycles. Even other instructions independent of them will experience a long delay because of the blocked commitment. By flushing these instructions from the offending thread, the ACE bit residency cycles and the average exposure time of ACE bits to soft error strikes is reduced significantly. Interestingly, we find that the fetch policy can increase the of the function unit and data cache. This is because additional resources freed by can speed up the execution of non-offending threads, which in turn leads to increased function unit utilization and frequent accesses to cache blocks, resulting in more bits within the cache that are vulnerable. The impact of the policy on SMT microarchitecture varies with the number and characteristics of threads. On 8 context workloads where the SMT processor yields more cache misses due to resource contention, shows a more distinct behavior than other fetch policies. Similarly, on memorybounded workloads which have larger memory footprints, reduces more noticeably. Interestingly, we observe that on a 4-context CPU workload, the policy results in an increase of. As we discussed earlier, the reduced number of execution cycles increases on the 4-context CPU workload. 5 45% 35% 3 25% 15% 1 5% P P P P P P CPU P P IQ ROB FU Reg LSQ_data LSQ_tag DL1_data DL1-tag Fetch policies such as,, P and DWarn respond to cache misses by either blocking instruction fetch from the offending threads or assigning a lower priority to those threads. Compared with the baseline, they can alleviate microarchitecture vulnerability in many cases. Nevertheless, they still exhibit higher than. These schemes are incapable of handling instructions that have already been brought into pipeline. Consequently, more ACE bits are brought into pipeline and stay in microarchitectural structures for the next several hundred cycles, contributing to high. In figure 4, we find that does not affect the IQ s compared with baseline, while it becomes quite effective in the 8- context workload. The reason is that resource contention in the 8-context workload results in a higher cache miss rate. Comparing the of with that of and P, outperforms the latter. The difference arises primarily because is more responsive to L2 cache misses which have a more significant impact on. By only monitoring outstanding L1 cache misses, and P are limited in their ability to respond to L2 cache misses and hence the effectiveness of reduction is restricted. Another interesting finding is that advanced fetch policies usually achieve a lower on the and workloads than the scheme. Conversely, on the 4- context CPU workloads, the advanced fetch policies can exhibit higher P P P P P P CPU P P IQ ROB FU Reg LSQ_data LSQ_tag DL1_data DL1-tag (a) 4 Contexts (b) 8 Contexts Figure 6. Microarchitecture under Different Fetch Policies Accounting for both reliability and performance, we present the IPC/ metric in Figure 7 for the five advanced fetch policies. All data shown in Figure 7 are normalized to the baseline results. A higher IPC/ means a greater amount of work can be completed between error occurrences and therefore suggests a better reliability/performance tradeoff. Mechanisms that reduce both the and the IPC may be worthwhile only if they increase the MITF. As shown in Figure 7, does the best overall concerning both reliability and performance. also shows its potential to achieve a desirable reliability-performance tradeoff. The reliability efficiency of the policy reaches its maximum value on the workloads. The reason for this is the ICOUT baseline is worse than at both throughput and reliability aspects due to memorybound threads. They penalize performance by clogging the IQ and increase by bringing in long-latency ACE bits. Contrary to, the policy handles memorybound threads better by solving both problems and achieves an improvement on both throughput and reliability. However, fails to outperform in the CPU workload due to the small number of cache misses present within that workload. Throughput and vulnerability can only benefit slightly from. In the workload, 175

8 is reduced substantially along with a slight performance improvement. On CPU workloads, due to the low cache miss frequency, the difference between an advanced fetch policy and the baseline scheme diminishes. As the number of thread contexts increases from 4 to 8, more cache misses occur on CPU bound workloads. As a result, the policy starts to manifest better reliability efficiency. When all threads become memory bound, no thread can make progress under the policy, and therefore the reliability efficiency drops again. Figure 7 shows that overall and show better reliability efficiency than other examined fetch policies. Although throughput represents the overall performance, it does not reflect the progress made by each individual thread hence fairness across all threads is not considered. To address this, we use weighted speedup and harmonic IPC as proposed in [14, 15]. Figure 8 shows performance and reliability tradeoff evaluation using these alternative performance metrics. The weighted speedup is obtained by normalizing each thread s performance in SMT to that in single thread execution and adding them together. The resulting value reflects the effective throughput of the workload compared to single thread execution. Harmonic IPC calculated the harmonic mean of the weighted IPC to quantify both performance and fairness. There are several interesting observations when different performance IPC/ metrics are used to quantify the reliability efficiency. As shown in Figure 7, if weighted IPC is used, always outperforms the other fetch schemes across all microarchitecture structures. When trade-offs are measured with weighted speedup, as shown in Figure 8 (a), the difference between and the others diminishes. When harmonic IPC is used, becomes a promising mechanism to achieve a good performance and reliability tradeoff on microarchitecture components such as the FU, DL1, and register file. The reason is that in contrast to, the policy does not yield a significant reliability benefit on the FU, DL1, DTLB and register. However, favors the execution progress of threads with low L2 cache miss rates and thus can penalize the weighted speedup and harmonic mean IPC. As a result, is the best choice for balancing reliability, throughput and fairness on those hardware structures. On the other hand, remains the best choice for the IQ, ROB and LSQ even when harmonic IPC is used. This is due to the significant reduction on these structures when the policy is applied. Although the average harmonic IPC yielded by on 4-context and 8- context workloads is 16% lower than that yielded by, the reduction achieved by is more impressive, around on average. On these structures, the benefit due to reduction out-weighs the loss in fairness due to unbalanced resource allocation P 1 0 Figure 7. A Comparison of Reliability Efficiency of Different Fetch Policies Harmonic IPC/ P 0 Weighted Speedup/ P IQ ROB FU Reg LSQ_data LSQ_tag DL1_data DL1-tag Figure 8(a) and (b). A Comparison of Reliability Efficiency using Different Performance Metrics 176

9 5. Discussion of Experiment Results The extensive simulation results we present in Section 4 reveal the challenges and opportunities in optimizing reliability for SMT microarchitecture. For example, our study shows that the highest vulnerability is likely to occur on the shared structures. To avoid vulnerability hotspots in their designs, architects need to first focus on protecting shared SMT microarchitecture structures from soft error strikes. Among the shared microarchitecture components, the IQ and register file stand out with respect to vulnerability, and the IQ particularly stands out in scenarios with many active threads. For single thread execution, several architectural level techniques [6] have been proposed to alleviate the potential soft error strikes on vulnerable program microarchitecture states. In [6], longlatency instructions are flushed to maintain a low IQ. Our experimental results show that although can effectively reduce the vulnerability of an offending thread on a SMT architecture, it does not ensure that the newly freed resource can be allocated to other threads in a manner that is always favorable to reliability. Our study indicates that in addition to coping with cache misses, limiting the scale of resources that can be shared by multiple threads can also contribute to reduction. In SMT processors, ILP as well as TLP is limited by program properties and available threads. By increasing the size of microarchitecture structures, architects aim to exploit more parallelism. Nevertheless, the performance gain does not correlate with the scale of hardware resources in a linear manner. This effect, on the other hand, has a great influence on reliability, because the increased size of a microarchitecture structure is likely to bring in more inflight instructions and expose more program states to softerror strikes. Reliability-aware resource allocation avoids resource abuse by threads with a high fraction of ACE-bits within the pipeline. For example, a thread s instructions can form a long dependence chain and clog the IQ. In this case, reliability-aware resource allocation, such as enabling predefined static IQ partitions for each thread, could help with vulnerability reduction. As a second example, reliability-aware fetch throttling, which is built on top of exiting fetch schemes and extended with reliability awareness of individual threads, can be used to maintain a low while achieving a high throughput. By comparing results that measure throughput/, weighted IPC/ and harmonic IPC/, we find that the reliability efficiency yielded on diminishes when fairness is considered. exhibits the potential to reduce vulnerability while maintaining fairness. However, our analysis indicates that a limitation of is that it stalls fetching for the offending thread upon detecting a L2 cache miss. This delay can result in introducing a significant number of ACE bits into pipeline. If the L2 cache misses can be predicted when the offending instruction enters the pipeline, fetch can be stalled immediately to ensure that no ACE bits are brought into pipeline. By incorporating a L2 cache miss prediction mechanism, can be further enhanced to reduce vulnerability while maintaining fairness. 6. Related Work There is a growing amount of work aimed at characterizing soft error behavior at the microarchitecture level. In [2, 3] detailed processor RTL models were used to estimate microarchitecture reliability. The RTL models contain all of the detailed information about the microprocessors. However, the simulation slowdown of RTL models is too expensive for architecture studies, in which the tradeoffs between many hardware configurations need to be considered. Moreover, these models are generally not available during the architectural exploration phase of a microprocessor design. In [4, 5] Mukherjee et al. introduced the concept of architecturally correct execution (ACE) bits to compute the of microarchitecture structures using a performance model. The vulnerability of hardware structures (e.g. instruction queue, execution unit, TLB and caches) of an Itanium2-like IA64 processor was studied [4, 5]. To reduce instruction queue, Weaver et al. [6] propose to selectively squash instructions when long delays are encountered. They examine cache-miss triggers and squashing actions to remove existing instructions from the instruction queue. In [7], Li and Adve developed SoftArch, an architecture level tool for modeling and analyzing soft errors. The SoftArch framework estimates reliability using a probabilistic model of the error generation and propagation process in a processor. In [8], microarchitecture vulnerability phase behavior during program execution is observed and its predictability is studied. As a complementary approach to computation, statistic fault injection has been used in several studies [2, 3, 4] to evaluate architectural reliability. To obtain statistic significance, a large number of experiments need to be performed on an investigated hardware component. To our knowledge, microarchitecture soft error vulnerability analysis so far has been exclusively focused on single thread execution environments. In the past, there have been numerous studies on performance characterization and optimization for SMT architectures. Power and thermal issues in SMT architecture design has also been studied recently in [23]. In [24, 25], SMT architectures were used to perform redundant thread execution for transient fault detection and recovery. Nevertheless, the implication of simultaneous multithreading on hardware reliability and the reliability efficiency of SMT architectures remain unexplored. 7. Conclusions The use of simultaneous multithreading techniques enhances system overall performance but also raise questions over their susceptibility to soft error strikes. In this paper, we provide an in-depth analysis of the impact of multithreading on processor vulnerability to transient 177

10 faults. We extend a SMT performance simulator with microarchitecture vulnerability computation models. Using programs from the SPEC CPU 2000 benchmark suite, we analyze the microarchitecture vulnerability of representative program mixes. Our major conclusions from this work are: (1) In general, the use of SMT techniques increase the vulnerability of shared microarchitecture structures; and the degree of vulnerability increases with the number of running threads. Our experimental results indicate that the IQ and register are more susceptible to soft errors than other structures we studied. To avoid vulnerability hotspots in their designs, architects need to first focus on protecting those shared SMT microarchitecture structures from soft error strikes; (2) Comparing the reliability of SMT to superscalar processors, multithreaded architectures exhibit an increased overall microarchitecture vulnerability, although are likely to reduce microarchitecture vulnerability of individual threads; (3) The microarchitecture vulnerability is sensitive to instruction fetch policies. Among all fetch policies we investigated, and are the most attractive schemes to reduce vulnerability while maintaining good reliability efficiency. However the advantage diminishes when fairness is taken into consideration. By analyzing experimental results, we have gained insightful knowledge on the SMT microarchitecture vulnerability profile and how this profile changes with workload behavior, the number of threads and fetch policies. We point out that there are several alternative optimization opportunities to improve reliability, such as enhancing the fetch policy using predicted cache misses to proactively stop fetching or dynamically distributing resources among threads based on their vulnerability profile. We plan to explore thread-aware reliability optimization techniques for SMT architecture in our future work. Acknowledgment This research is partially supported by Microsoft Research Trustworthy Computing award no and by NASA award no. NCC References [1] P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi, Modeling the Effect of Technology Trends on Soft Error Rate of Combinational Logic, In Proceedings of the International Conference on Dependable Systems and Networks, [2] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August and S. S. Mukherjee, Design and Evaluation of Hybrid Fault-Detection Systems, In proceeding of International Symposium on Computer Architecture, [3] N. J. Wang, J. Quek, T. M. Rafacz and S. J. Patel, Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline, In proceeding of International Conference on Dependable Systems and Networks, [4] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor, In Proceedings of the International Symposium on Microarchitecture, [5] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, In Proceedings of the International Symposium on Computer Architecture, [6] C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor, In Proceedings of the International Symposium on Computer Architecture, [7]X. D. Li, S. V. Adve, P. Bose, and J. A. Rivers, SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors, In Proceedings of the International Conference on Dependable Systems and Networks, [8] X. Fu, J. Poe, T. Li and J. Fortes, Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior, In Proceedings of the International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006 [9] D. Tullsen, S. Eggers, and H. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism, In Proceedings of the International Symposium on Computer Architecture, [10] P. Kongetira, K. Aingaran, and K. Olukotun, Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro, vol. 25, no. 2, pp , Mar/Apr, [11] D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton, Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal, 6(1), Feb [12]J. Seng, D. Tullsen, and G. Cai, Power-Sensitive Multithreaded Architecture, In Proceedings of the International Conference on Computer Design, [13] J. Hasan, A. Jalote, T. N. Vijaykumar, and C. Brodley, Heat Stroke: Power-Density-based Denial of Service in SMT, In Proceedings of the International Symposium on High-Performance Computer Architecture, [14]K. Luo, J. Gummaraju, and M. Franklin, Balancing Throughput and Fairness in SMT Processors, In Proceedings of the International Symposium on Performance Analysis of Systems and Software, [15] S. Raasch and S. Reinhardt, The Impact of Resource Partitioning on SMT Processors, In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, [16] F. Cazorla, E. Fernandez, A. Ramirez, and M. Valero, Dynamically Controlled Resource Allocation in SMT Processors, In Proceedings of the International Symposium on Microarchitecture, [17]E. W. Czeck and D. Siewiorek, Effects of Transient Gate-level Faults on Program Behavior, In Proceedings of the International Symposium on Fault-Tolerant Computing, [18] Joseph Sharkey, M-Sim: A Flexible, Multithreaded Architectural Simulation Environment, Technical Report CS-TR-05-DP01, Department of Computer Science, State University of New York at Binghamton, [19] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, In Proceedings of the International Symposium on Computer Architecture, [20] D. Tullsen and J. Brown, Handling Long-latency Loads in a Simultaneous Multithreading Processor, In Proceedings of the International Symposium on Microarchitecture, [21]A. El-Moursy and D. H. Albonesi, Front-end Policies for Improved Issue Efficiency in SMT Processors, In Proceedings of the International Symposium on High Performance Computer Architecture, [22] F. J. Cazorla, E. Fernandez, A. Ramirez, and M. Valero, Dcache Warn: an I-Fetch Policy to Increase SMT Efficiency, In Proceedings of International Parallel and Distributed Processing Symposium, [23] J. Hasan et al., Heat Stroke: Power-Density-Based Denial of Servie in SMT, In Proceedings of the International Sympoisum on High- Performance Computer Architecture, [24] T. N. Vijaykumar, I. Pomeranz and K. Cheng, Transient-Fault Recovery via Simultaneous Multithreading, In Proceedings of the International Symposium on Computer Architecture, [25] S. Reinhardt and S. Mukherjee, Transient Fault Detection via Simultaneous Multithreading, In Proceedings of the International Symposium on Computer Architecture,

Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior

Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior in Fu James Poe Tao Li José A. B. Fortes Department of ECE University of Florida xinfu@ufl.edu Department of ECE University of Florida

More information

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency

DCache Warn: an I-Fetch Policy to Increase SMT Efficiency DCache Warn: an I-Fetch Policy to Increase SMT Efficiency Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona 1-3,

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection

More information

Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Abstract Transient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES

LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES Shane Carroll and Wei-Ming Lin Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio,

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy, Rajeev Garg, David H. Albonesi and Sandhya Dwarkadas

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy, Rajeev Garg, David H. Albonesi and Sandhya Dwarkadas Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy, Rajeev Garg, David H. Albonesi and Sandhya Dwarkadas Depments of Electrical and Computer Engineering and of Computer

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Computing Architectural Vulnerability Factors for Address-Based Structures

Computing Architectural Vulnerability Factors for Address-Based Structures Computing Architectural Vulnerability Factors for Address-Based Structures Arijit Biswas 1, Paul Racunas 1, Razvan Cheveresan 2, Joel Emer 3, Shubhendu S. Mukherjee 1 and Ram Rangan 4 1 FACT Group, Intel

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Sim-SODA: A Unified Framework for Architectural Level Software Reliability Analysis

Sim-SODA: A Unified Framework for Architectural Level Software Reliability Analysis Sim-SODA: A Unified Framework for Architectural Level Software Reliability Analysis Xin Fu Department of ECE University of Florida xinfu@ufl.edu Tao Li Department of ECE University of Florida taoli@ece.ufl.edu

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

Adaptive Cache Memories for SMT Processors

Adaptive Cache Memories for SMT Processors Adaptive Cache Memories for SMT Processors Sonia Lopez, Oscar Garnica, David H. Albonesi, Steven Dropsho, Juan Lanchares and Jose I. Hidalgo Department of Computer Engineering, Rochester Institute of Technology,

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs International Journal of Computer Systems (ISSN: 2394-1065), Volume 04 Issue 04, April, 2017 Available at http://www.ijcsonline.com/ An Intelligent Fetching algorithm For Efficient Physical Register File

More information

PROBABILITY THAT A FAULT WILL CAUSE A DECLARED ERROR. THE FIRST

PROBABILITY THAT A FAULT WILL CAUSE A DECLARED ERROR. THE FIRST REDUCING THE SOFT-ERROR RATE OF A HIGH-PERFORMANCE MICROPROCESSOR UNLIKE TRADITIONAL APPROACHES, WHICH FOCUS ON DETECTING AND RECOVERING FROM FAULTS, THE TECHNIQUES INTRODUCED HERE REDUCE THE PROBABILITY

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Reliability in the Shadow of Long-Stall Instructions

Reliability in the Shadow of Long-Stall Instructions Reliability in the Shadow of Long-Stall Instructions Vilas Sridharan David Kaeli ECE Department Northeastern University Boston, MA 2115 {vilas, kaeli}@ece.neu.edu Arijit Biswas FACT Group Intel Corporation

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical

More information

Eliminating Microarchitectural Dependency from Architectural Vulnerability

Eliminating Microarchitectural Dependency from Architectural Vulnerability Eliminating Microarchitectural Dependency from Architectural Vulnerability Vilas Sridharan and David R. Kaeli Department of Electrical and Computer Engineering Northeastern University {vilas, kaeli}@ece.neu.edu

More information

Using Hardware Vulnerability Factors to Enhance AVF Analysis

Using Hardware Vulnerability Factors to Enhance AVF Analysis Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan and David R. Kaeli ECE Department Northeastern University Boston, MA 02115 {vilas, kaeli}@ece.neu.edu ABSTRACT Fault tolerance

More information

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches Sonia López 1, Steve Dropsho 2, David H. Albonesi 3, Oscar Garnica 1, and Juan Lanchares 1 1 Departamento de Arquitectura de Computadores y Automatica,

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

Boosting SMT Performance by Speculation Control

Boosting SMT Performance by Speculation Control Boosting SMT Performance by Speculation Control Kun Luo Manoj Franklin ECE Department University of Maryland College Park, MD 7, USA fkunluo, manojg@eng.umd.edu Shubhendu S. Mukherjee 33 South St, SHR3-/R

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Reorder Buffer Complexity Through Selective Operand Caching Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev

More information

Power-Efficient Approaches to Reliability. Abstract

Power-Efficient Approaches to Reliability. Abstract Power-Efficient Approaches to Reliability Niti Madan, Rajeev Balasubramonian UUCS-05-010 School of Computing University of Utah Salt Lake City, UT 84112 USA December 2, 2005 Abstract Radiation-induced

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Cache Implications of Aggressively Pipelined High Performance Microprocessors Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

Exploiting Value Prediction for Fault Tolerance

Exploiting Value Prediction for Fault Tolerance Appears in Proceedings of the 3rd Workshop on Dependable Architectures, Lake Como, Italy. Nov. 28. Exploiting Value Prediction for Fault Tolerance Xuanhua Li and Donald Yeung Department of Electrical and

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering The University

More information

Applications of Thread Prioritization in SMT Processors

Applications of Thread Prioritization in SMT Processors Applications of Thread Prioritization in SMT Processors Steven E. Raasch & Steven K. Reinhardt Electrical Engineering and Computer Science Department The University of Michigan 1301 Beal Avenue Ann Arbor,

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors

Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors R. Ubal, J. Sahuquillo, S. Petit and P. López Department of Computing Engineering (DISCA) Universidad Politécnica de Valencia,

More information

Reliable Architectures

Reliable Architectures 6.823, L24-1 Reliable Architectures Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 6.823, L24-2 Strike Changes State of a Single Bit 10 6.823, L24-3 Impact

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

SEVERAL studies have proposed methods to exploit more

SEVERAL studies have proposed methods to exploit more IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying

More information

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors In Proceedings of the th International Symposium on High Performance Computer Architecture (HPCA), Madrid, February A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth

More information

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing.

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing. (big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing. Intro: CMP with MT cores e.g. POWER5, Niagara 1 & 2, Nehalem Off-chip miss

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure

More information