An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures

Size: px

Start display at page:

Download "An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures"

Barnaby Anderson
5 years ago
Views:

1 An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures Wangyuan Zhang, Xin Fu, Tao Li and José Fortes Department of Electrical and Computer Engineering, University of Florida {zhangwy, Abstract Semiconductor transient faults (i.e. soft errors) have become an increasingly important threat to microprocessor reliability. Simultaneous multithreaded (SMT) architectures exploit thread-level parallelism to improve overall processor throughput. A great amount of research has been conducted in the past to investigate performance and power issues of SMT architectures. Nevertheless, the effect of multithreaded execution on a microarchitecture s vulnerability to soft error remains largely unexplored. To address this issue, we have developed a microarchitecture level soft error vulnerability analysis framework for SMT architectures. Using a mixed set of SPEC CPU 2000 benchmarks, we quantify the impact of multithreading on a wide range of microarchitecture structures. We examine how the baseline SMT microarchitecture reliability profile varies with workload behavior, the number of threads and fetch policies. Our experimental results show that the overall vulnerability rises in multithreading architectures, while each individual thread shows less vulnerability. By considering both performance and reliability, SMT outperforms superscalar architectures. The SMT reliability and its tradeoff with performance vary across different fetch policies. With a detailed analysis of the experimental results, we point out a set of potential opportunities to reduce SMT microarchitecture vulnerability, which can serve as guidance to exploiting thread-aware reliability optimization techniques in the near future. To our knowledge, this paper presents the first effort to characterize microarchitecture vulnerability to soft error on SMT processors. 1. Introduction Semiconductor transient faults (i.e. soft errors) have become an increasingly important threat to microprocessor reliability. Transient faults, also known as soft errors, are caused by cosmic rays or substrate alpha particles that can potentially corrupt program data. With the advance of VLSI technologies, the next generation of microprocessors is projected to be more susceptible to soft error strikes due to the continuously reduced feature size and supply voltage, and increasing clock frequency and on-chip density [1]. The rapidly increasing soft error rates and the advent of billion transistor chips suggest that it will be infeasible to protect every transistor from soft error strikes. At the microarchitecture level, a significant fraction of soft errors can be efficiently masked. Motivated by this observation, a growing number of studies [1, 2, 3, 4, 5, 6, 7, 8] have focused on characterizing microarchitecture soft error behavior. These studies, however, exclusively focus on single thread execution environments. Due to the diminishing instruction level parallelism (ILP) performance gains available on wider issue superscalar processors, simultaneous multithreaded (SMT) architectures [9] have been proposed and used in commercial processors [10, 11] to exploit thread-level parallelism. The SMT architectures improve the performance of a superscalar processor by dynamically sharing pipeline and microarchitecture resources among multiple, concurrently running threads. In the past, the performance and power issues of SMT architectures have been extensively studied [12, 13, 14, 15, 16]. As soft errors continue to become an increasing threat to hardware reliability, it is important to characterize and understand the impact of simultaneous multithreading on processor dependability. The effects of multithreading on reliability can be two-fold. Compared with a superscalar processor on which threads are executed sequentially, a SMT processor can concurrently process instructions from multiple threads to deliver a higher throughput. However, a reliability issue associated with this speedup is that the fine-grained resource sharing and the elevated hardware utilization in SMT processors may expose more program runtime states to neutron or alpha particle strikes at any given time, resulting in an increased microarchitecture susceptibility to transient faults. To characterize and understand hardware vulnerability to soft error on SMT architectures, we have developed a reliability-aware SMT processor simulator. The framework provides cycle-accurate, microarchitecture level reliability estimation for a parameterizable SMT architecture. Using a mixed set of SPEC CPU 2000 benchmarks, we quantify microarchitecture reliability efficiency on SMT architectures with different numbers of thread contexts and compare them with that of superscalar execution. Additionally, we examine the impact of various SMT fetch policies on microarchitecture reliability and further contrast reliability and performance tradeoffs across different workloads. The major observations and conclusions that can be drawn from our study are: For SMT architectures, shared microarchitecture structures are more susceptible to soft error strikes due to their higher resource utilization. Compared to superscalar execution, SMT can still achieve better performance and reliability tradeoffs. In spite of increasing overall microarchitecture vulnerability, compared to superscalar execution, SMT execution is likely to reduce the microarchitecture vulnerability of individual threads /07/$ IEEE 169

2 Fetch policies play an important role in determining SMT microarchitecture reliability domain characteristics and efficiency. Compared to a SMT performance metric such as throughput, the reliability metric is more sensitive to the fetch policy. For example, SMT architectures employing two different fetch policies can result in similar throughput while exhibiting significantly different reliability behavior. Among all fetch policies we examined in this study, and exhibit the most attractive behavior in both reliability and its tradeoff with performance. However, when fairness is considered, their advantages diminish. The rest of the paper is organized as follows. Section 2 provides a brief background on the microarchitecture level soft error vulnerability analysis methods we used in this study. Section 3 describes our simulation framework, methodology, and experimental setup. Section 4 presents a detailed microarchitecture soft error vulnerability profile on a SMT processor. Section 5 discusses the experimental results and highlights thread-aware reliability optimization opportunities. Section 6 summaries related work. Section 7 concludes the paper and outlines our future work. 2. Microarchitecture Soft Error Vulnerability Analysis Several techniques have been proposed to model processor vulnerability to soft error at the microarchitecture level. In [7], Li and Adve estimate reliability using a probabilistic model of the error generation and propagation process in a processor. In the past, statistic fault injection [3, 17] has also been used to evaluate architectural reliability. In this work, we estimate the reliability of SMT architectures using Architectural Vulnerability Factor () analysis methods introduced in [4, 5]. In this section, we briefly describe the computation methods to provide sufficient background for the rest of the paper. A hardware structure s refers to the probability that a transient fault in that hardware structure will result in incorrect program results. The overall hardware structure s error rate is decided by two factors: the device raw error rate, mainly determined by circuit design and processing technology, and the. Since the hardware raw error rate does not change with code execution, the can be used as a reliability metric to estimate how vulnerable the hardware is to soft errors during program execution. To compute, one needs to classify hardware states into bits that store information that can affect the final program output and those that can not. The processor state bits required for architecturally correct execution (ACE) are called ACE bits [4]. In a given cycle, the of a hardware structure is the percentage of ACE bits that the structure holds. The of a hardware structure is derived by averaging the s of the structure across program execution. In reality, it is much more feasible to identify un-ace bits, the processor state bits that do not affect correct program execution. Examples of locations that contain un-ace bits include NOPs, idle or invalid states, uncommitted instructions, dynamically dead instructions and data, and cache lines that will not be accessed before eviction. A cycle accurate execution driven performance model can be used to identify un-ace bits and to track the residency cycles of un-ace bits in hardware structures. Through cycle-level simulation, processor microarchitecture states are classified into ACE/un-ACE bits and their residency and resource usage counts are generated. This information is then used to compute the of various hardware structures. To estimate the reliability of the entire processor, one can add the values of all of the hardware structures together by weighting them by the number of bits within each structure. In this work, we focus on computing the reliability of an individual microarchitecture structure since we believe that a component-based vulnerability analysis provides readers better knowledge and insight on the impact of SMT techniques on processor reliability. 3. Experimental Setup We have developed a framework that estimates the architectural and microarchitectual effects of soft errors on SMT architectures. Our reliability-aware SMT simulation framework is built on a heavily modified and extended M- Sim simulator [18], which models a detailed, execution driven simultaneous multithreading processor with both shared architecture components such as the instruction queue, physical register file pool, function unit and cache as well as private structures for individual threads including separate rename tables, program counters, reorder buffers, branch predictors and load/store queues. To model reliability, we implement and extend the computation methods proposed in [4, 5] to support SMT architecture. As a result, our framework is capable of tracking the microarchitecture vulnerability contributed by each individual thread and summarizing the aggregated microarchitecture vulnerability due to multithreading. Our SMT reliability analysis framework covers a wide range of shared and non-shared microarchitecture components including the instruction queue, register file, function unit, reorder buffer, L1 data cache, TLB and load/store queue. At the end of each simulation, the framework outputs reliability estimates along with performance data. Table 1 shows the baseline machine configuration we used in this study. In a SMT processor, instruction fetch policies largely decide how processor resources are shared among threads and therefore play an important role in determining the overall performance of a SMT processor. We use [19] as the baseline fetch policy. The fetch policy assigns the highest priority to the thread that has the fewest in-flight instructions. In Section 4.3, we further examine a set of advanced fetch policies such as [20], [21], P [21], DWarn [22] and 170

3 [20] to investigate the impact of fetch policies on SMT microarchitecture reliability. Table 1. Simulated Machine Configuration Parameter Configuration Processor Width 8-wide fetch/issue/commit Baseline Fetch Policy Pipeline Depth 7 Issue Queue 96 ITLB 128 entries, 4-way, 200 cycle miss Branch Prediction 2K entries Gshare, 10-bit global history per thread BTB 2K entries, 4-way per thread Return Address Stack 32 entries L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 96 entries per thread Load/ Store Queue 48 entries per thread Integer ALU 8 I-ALU, 4 I-MUL/DIV, 4 Load/Store FP ALU 8 FP-ALU, 4 FP-MUL/DIV/SQRT DTLB 256 entries, 4-way, 200 cycle miss latency L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle access L2 Cache unified 2MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 64 bit wide, 200 cycles access latency The SMT workloads in our experiments are comprised of SPEC CPU 2000 integer and floating point benchmarks. Because the characteristics of SMT workloads vary significantly and strongly depend on both the number of threads and the individual thread behavior, we create a set of SMT workloads with the number of threads ranging from 2 to 8 contexts and with the individual thread characteristics ranging from computation intensive to memory access intensive (see Table 2). We first categorize a SPEC benchmark into CPU intensive (CPU) or memory intensive () based on its IPC and cache miss rate after performing a simulation of 100M instructions from the selected execution point. The CPU and workloads consist of programs all from the CPU intensive and intensive groups respectively. Half of the programs in a SMT workload with mixed behavior () are selected from the CPU intensive group and the rest are selected from the intensive group. To ensure that our experimental results are not biased by a specific set of threads; we build two groups for each type of SMT workload and report average statistics wherever possible. The only exception is the 8 context workloads due to the insufficient number programs that can be used to form two workload groups with enough diversity. We use the Simpoint tool to pick the most representative simulation point for each benchmark and each benchmark is fast-forwarded to its representative point before detailed multithreaded simulation takes place. The simulations are terminated once the total number of simulated instructions reaches 50 million, 100 million and 200 million in 2, 4, and 8 context workloads respectively. Due to the high cost of computation methods and the large number of simulations we perform in this study, we are unable to simulate a larger number of instructions. Number of Contexts 2-Thread 4-Thread 8-Thread Table 2. The Studied SMT Workloads Thread Type Workload Group A Workload Group B CPU bzip2, eon facerece, wupwise eon, twolf wupwise, equake mcf, twolf equake, vpr CPU bzip2, eon, perlbmk, mesa, gcc, perlbmk facerec, wupwise gcc, mcf, prlbmk, mesa, vpr, perlbmk twolf, applu mcf, equake, twolf, galgel, vpr, swim applu, lucas gap, bzip2, facerec, gap, crafty, gcc, CPU eon, mesa, perlbmk, eon, mesa, perlbmk, parser, wupwise fma3d, wupwise perlbmk, mcf, bzip2, vpr, mesa, swim, eon, lucas crafty, fma3d, applu, twolf, equake, mgrid, wupwise, perlbmk mcf, twolf, swim, lucas, equake, applu, vpr, mgrid We use several metrics to quantify the reliability domain behavior of SMT architectures. The is used as a baseline metric to estimate how susceptible a microarchitecture structure is to soft error strikes. However, raw values are likely to be misleading, as they can be deflated by stretched execution cycles. In this study, we further use the Mean Instructions to Failure (MITF) metric to reason about the tradeoff between performance and reliability [6]. MITF represents the amount of work a processor can complete, on average, between two failures. A higher MITF implies a greater amount of work accomplished between errors, which is desirable. At a fixed frequency and raw error rate, MITF is proportional to the ratio of IPC to ( IPC / ). Results throughout this paper are reported using both metrics. 4. Microarchitecture Soft Error Vulnerability of SMT Processors This section presents a detailed study of microarchitecture vulnerability to soft error on SMT processors. We start by profiling the reliability domain characteristics across different microarchitecture structures in a SMT processor. We then contrast superscalar and SMT execution in terms of their reliability characteristics. Finally, we analyze how different fetch policies affect SMT architecture reliability domain behavior. 4.1 SMT Microarchitecture Vulnerability Profile We first set up a set of experiments to answer the following questions. (1) What is the microarchitecture reliability profile of the studied SMT processor? (2) Which SMT microarchitecture structures are more susceptible to soft error strikes? (3) Is SMT architecture beneficial to the reliability of individual threads as well as the whole multithreaded workload? (4) Do the above SMT microarchitecture reliability profiles change with different workload mixes? 171

4 Figure 1 shows the microarchitecture vulnerability profile of the studied SMT processor running 4-context multithreaded workloads. The microarchitecture structures are grouped together as shared pipeline structures, shared memory structures and non-shared structures. In general, the shared structures exhibit a higher than the nonshared structures. Among the shared pipeline components, the instruction queue (IQ) and register files show higher vulnerability. This is because the exploiting of thread level parallelism results in improved resource utilization on these structures. For example, by taking advantage of available idle entries in the IQ, more ACE bits are brought into the instruction queue waiting for execution, increasing the amount of time they are exposed to soft error strikes and thus making the IQ more susceptible to soft error strikes. A similar scenario occurs with the register files. Instead of sitting idle, more physical registers in the pool are allocated. Surprisingly, among the shared memory structures, we find that the DL1 tag exhibits a higher vulnerability than the DL1 data array. Although intuitively the data array contains bits which are more critical to program execution, only a portion of the cache block is read and written upon a given memory reference. Soft error strikes on bits that are not accessed by the processor will not affect the program results. Unlike the data array, however, all of the tag bits are used to check for a match. 3 1 IQ FU Reg DL1_data DL1_tag ROB LSQ_data LSQ_tag Shared Pipeline Structures Shared Memory Structures CPU Non-shared Structures Figure 1. Microarchitecture Vulnerability Profile of the studied SMT Processor (# of context = 4) Compared to the CPU-bound workloads, the memorybound workloads yield a higher on microarchitecture structures that are used to extract instruction level parallelism (e.g. IQ, register file, ROB and LSQ). For example, memory-bound workloads increase the by 58%, 61%, 82% and 94% of the IQ, register file, ROB and LSQ respectively. This is because the memory-bound workloads increase the cache miss rate. Upon a cache miss, the instruction experiencing the cache miss as well as all instructions along the dependency chain can not be processed in the pipeline until the cache miss is served. The ACE bits contributed by these instructions will stall in the above structures for a long period of time, especially when an L2 cache miss occurs. The increased ACE bit residency increases the of these structures significantly. On the other hand, the s of the function unit (FU) and the DL1 data array are reduced in workloads. The reduction in the function unit is due to the increased fraction of idle cycles caused by the diminished ILP in memory-bound workloads and stretched execution times. As for the DL1 data array, the memorybound workloads generate more evictions and refills which convert ACE cycles to un-ace cycles. More specifically, CPU accesses to the same cache block might not be clustered. There might be a long time between two cache hits on the same block. During this period, soft error strikes on a portion of data within that block which will be accessed later can lead to an error. On workloads, the increased number of cache misses can cause blocks to be evicted and then refilled more frequently. The increased cache line competition reduces the residency cycles during which a cache block is present, which in turn reduces the of memory-related structures on workloads. IPC/ CPU IQ FU Reg DL1_data DL1_tag ROB LSQ_data LSQ_tag Shared Pipeline Structure Shared Memory Structure Non-shared Structure Figure 2. Microarchitecture Reliability Efficiency (Measured as IPC/) (# of context = 4) Figure 2 shows microarchitecture reliability efficiency, as measured using the IPC/ metric, across different workload mixes. As can be seen, SMT microarchitecture yields the highest reliability efficiency on CPU-bound workloads, which means a greater amount of work can be completed between errors. This suggests that despite the higher resource utilization which can increase the quantity of ACE bits stored within microarchitecture structures, the residency cycle reduction for ACE bits on high ILP workloads results in an overall improvement in reliability efficiency. Having profiled the baseline SMT microarchitecture reliability characteristics, we set up a set of experiments to contrast the reliability of single thread execution to that of SMT execution. Our objective was to answer the following question. Is SMT execution beneficial to the reliability of individual threads? In single thread execution, a thread can occupy all pipeline resources and therefore take advantage of exclusive access to all of the processor s resources to exploit maximum ILP. In SMT execution, a thread only uses a portion of all available resources and thus progresses slower. The number of ACE bits that each thread has exposed to soft error strikes is correlated to the scale of its allocated resources, while the ACE bit residency time depends on the thread s execution speed. Since the overall is governed by both of these factors, increasing with the quantity of ACE bits and decreasing as the residency time is decreased, there is no simple answer to this problem. Figure 3 breaks down the IQ, ROB and FU of each individual thread in the SMT execution (4 contexts) and compares it to that of single thread execution. To allow for

5 direct comparison of results, we record the progress of each thread in the SMT execution and then simulate the same amount of instructions for each thread in the single thread execution mode. This ensures that the work completed in sequential single thread execution is identical with that completed in concurrent SMT execution. Figure 3 shows that for individual threads, the single thread execution yields higher microarchitecture vulnerability than the vulnerability contributed by the same thread in the SMT execution mode. For example, the IQ and ROB contributed by gcc drops by 91% and 87% respectively when it is paired with mcf, vpr, and perlbmk in SMT execution. This indicates that for individual threads, the increase in due to fully occupying pipeline resources out-weighs the decrease in due to the acceleration in processing ACE bits. 8 6 IQ_ST FU_ST ROB_ST IQ_SMT FU_SMT ROB_SM T bzip eon gcc perlbmk gcc mcf vpr perlbmk mcf equake vpr sw im all all all threads threads threads CPU CPU Figure 3. Microarchitecture Vulnerability: SMT vs. Single Thread (ST) Execution The aggregated contributed by all of the threads in SMT execution exceeds the weighted yielded by the microarchitecture when all of the threads are executed sequentially. The weighted in sequential execution is derived using an individual thread s weighted by the fraction of work that each thread completes. It is not surprising to observe a 2X increase in the IQ on a 4- context SMT processor where four CPU workload threads execute simultaneously and much more pressure is exerted on the IQ. Function units experience a similar trend as exploiting both ILP and TLP can keep function units busy more frequently. The ROB reduction in SMT execution is somewhat counterintuitive. During SMT execution, instruction dispatching from the ROB to the IQ is more likely to be stalled due to competition for dispatching bandwidth and available IQ entries among different threads, resulting in additional dispatch delay which in turn increase the residency time of ACE bits in the ROB. However, there are additional factors that can affect the ROB. A closer investigation indicates that the size of the register file limits the ROB utilization during SMT execution. Since the register file is a shared resource and its size is fixed, the number of available registers that can be used for each thread to perform renaming diminishes with the number of threads. This in turn restricts the number of ROB entries allocated for each thread. This differs from single threaded execution, where a thread has exclusive access to all registers, allowing the allocation of more ROB entries. For our experiments, the 0.83 reduction in available registers to each thread significantly reduces ROB utilization and. IPC/ IQ_ST FU_ST ROB_ST IQ_SMT FU_SMT ROB_SMT bzip eon gcc perlbmk gcc mcf vpr perlbmk mcf equake vpr sw im all all all threads threads threads CPU CPU Figure 4. Microarchitecture Reliability Efficiency: SMT vs. Single Thread (ST) Execution As stated in Section 3, can be used to reflect how vulnerable a hardware structure is to soft errors. However, raw can be misleading when it is used to compare the reliability of different design alternatives. Reliability efficiency offers insight into the tradeoffs between performance and reliability. Figure 4 shows the reliability efficiency of SMT execution and single thread execution. Several interesting observations can be made from our experimental results. First, the reliability efficiency of the FU for all threads is quite similar. This can be explained as follows: is computed using the number of ACE bit residency cycles divided by the total number of execution cycles. IPC includes the total number of execution cycles in its denominator as well. Therefore, the IPC/ metric removes the number of executed cycles and leaves only information on the amount of work that is completed. Because the same number of instructions is processed in SMT and stand-alone execution, IPC/ is the same for the FU. Second, we find that on CPU workloads, standalone execution generally yields better reliability efficiency on the IQ, whereas on workloads, SMT execution can be beneficial to enhance the IQ reliability efficiency for individual threads. However, this trend becomes unpredictable on workloads. Essentially, threads in CPU workloads benefit from exclusive access to all resources in single threaded execution, although this increases the number ACE bits brought into the IQ. This increase in is offset by a relatively larger increase in performance and thus the improved performance dominates the IPC/ results. While the overall throughput is increased during SMT execution, both the performance of each individual thread and the quantity of ACE bits contributed by individual threads is decreased, and thus both the IPC and the are scaled down proportionally. On the workloads, threads show high vulnerability during stand-alone execution and yet are unable to achieve much of an improvement in throughput on SMT execution due to the high cache miss rates across all of the threads. Reliability efficiency is therefore primarily determined by the increase in. Overall, the IQ is more burdened in SMT architecture than in superscalar architecture and leads to a higher vulnerability. Comparing the overall of multithreaded execution versus the aggregated of 173

6 superscalar execution, we find that although the IPC/ metric is biased towards single threaded execution for some individual thread/structure cases, when considering the overall reliability efficiency of workloads, SMT architecture outperforms superscalar for all of the cases except the IQ on CPU workloads. This exception is due to the relatively large increase in as compared to that of performance 4.2 Microarchitecture Reliability vs. Number of Contexts Figure 5 illustrates the SMT microarchitecture trend as the number of thread contexts is increased. Shared structures such as the IQ show a steady increase in as more threads are added. The of the register file increases rapidly from 2-context workloads to 4-context workloads. A limited increase is observed when the number of contexts is further increased from 4 to 8, especially on and workloads. This is because as the number of contexts is increased, or more threads become memory bounded, an increased number of instructions can experience cache misses due to resource contention and poor cache behavior. These instructions, especially those experiencing L2 cache misses, can hold allocated registers for a long period of time. By further investigating the register life cycle, we found that registers remain in an allocated state without holding valid data until the write back stage; soft error strikes during this period of time have no effect on the register values, since the registers will be overwritten during the write back stage. So the time period from the renaming stage to the write back stage is treated as an un-ace period, meaning that it is not vulnerable. A long un-ace period will reduce the fraction of time that a register holds valid data leading to reduced vulnerability. Interestingly, we find that on workloads, the DL1 data array decreases as the number of contexts increases. This is because handling additional DL1 misses increases the frequency of evictions and updates in DL1 operations. The ACE bit life cycle is reduced. On the other hand, compared with workloads, workloads are able to deliver higher throughput and finish execution in shorter periods of time. Because of the reduced execution time, the percentage of ACE bits within the structure is larger at any given time in the workload, and this is reflected by higher in the workload than in the workload. For DL1 data array, the increased is caused by an increased number of threads, memory reference frequency and footprint. The FU shows heterogeneous behavior when adding more thread contexts. On CPU workloads, increasing thread contexts from 2 to 4 improves function unit utilization and. However, when the number of contexts reaches 8, threads aggressively compete for resources, resulting in more contention and hence extended total execution time. By definition, is determined by ACE bit residency time over total execution cycles. When the whole execution is stretched, the is reduced. This observation holds valid on the workload. On workloads, the limited ILP exploited across multiple threads does not reach the upper bound processing capability of the pipeline; therefore, FU continuously increases when more threads are added. 6 IQ FU ROB Reg CPU 6 LSQ_tag DL1_tag LSQ_data DL1_data CPU Figure 5. Microarchitecture Vulnerability vs. Number of Contexts 4.3 Impact of Fetch Policy on SMT Microarchitecture Reliability In the performance domain, the overall throughput that a SMT processor can deliver highly depends on the fetch schemes, which control the quantity of instructions that are brought into pipeline from multiple threads. From a reliability perspective, the fetch policies affect the quantity of ACE bits that can co-exist in a SMT processor pipeline as well as the speed that the pipeline can process these ACE bits. In this section, we quantify the impact of various fetch policies on SMT microarchitecture reliability. There has been abundant work on optimizing frond-end SMT fetch mechanisms in past literature. From among those, we selected five fetch policies (e.g.,,, P, DWarn and ) to use in our study. The above five fetch policies differ with each other in terms of how they react to long-latency instructions. Long-latency instructions can significantly increase the residency cycles within the microarchitecture structures and increase the. frees up pipeline resources for threads suffering L2 cache misses by squashing the dispatched instructions from the offending threads. There are several alternative schemes to determine when to flush the pipeline. We implement a policy that flushes the pipeline from the first instruction following a cache miss instruction. prevents instruction fetching for threads that have L2 cache misses but always allows at least one thread to continue fetching instructions. and P stop fetching once a thread has several outstanding L1 cache miss instructions. The difference between and P is P predicts L1 cache misses to minimize the delay of decision making. Instead of stopping fetching instructions, DWarn policy assigns lower fetch priority to threads with outstanding data cache misses. Figure 6 shows the of SMT microarchitecture running on 4-context and 8-context workloads under different fetch policies. Among all of the fetch policies we examined, shows the most distinct behavior. The of both shared and non-shared microarchitecture structures (e.g. IQ, ROB and LSQ) reduces significantly when the fetch policy is applied. For example, the 174

7 IQ, ROB and LSQ under the fetch policy are only about 5 of the under other fetch policies. The reason for this is, upon a L2 cache miss occurrence, the ACE bits contributed by instructions which have a dependence on the cache miss will remain in these structures for at least hundreds of cycles. Even other instructions independent of them will experience a long delay because of the blocked commitment. By flushing these instructions from the offending thread, the ACE bit residency cycles and the average exposure time of ACE bits to soft error strikes is reduced significantly. Interestingly, we find that the fetch policy can increase the of the function unit and data cache. This is because additional resources freed by can speed up the execution of non-offending threads, which in turn leads to increased function unit utilization and frequent accesses to cache blocks, resulting in more bits within the cache that are vulnerable. The impact of the policy on SMT microarchitecture varies with the number and characteristics of threads. On 8 context workloads where the SMT processor yields more cache misses due to resource contention, shows a more distinct behavior than other fetch policies. Similarly, on memorybounded workloads which have larger memory footprints, reduces more noticeably. Interestingly, we observe that on a 4-context CPU workload, the policy results in an increase of. As we discussed earlier, the reduced number of execution cycles increases on the 4-context CPU workload. 5 45% 35% 3 25% 15% 1 5% P P P P P P CPU P P IQ ROB FU Reg LSQ_data LSQ_tag DL1_data DL1-tag Fetch policies such as,, P and DWarn respond to cache misses by either blocking instruction fetch from the offending threads or assigning a lower priority to those threads. Compared with the baseline, they can alleviate microarchitecture vulnerability in many cases. Nevertheless, they still exhibit higher than. These schemes are incapable of handling instructions that have already been brought into pipeline. Consequently, more ACE bits are brought into pipeline and stay in microarchitectural structures for the next several hundred cycles, contributing to high. In figure 4, we find that does not affect the IQ s compared with baseline, while it becomes quite effective in the 8- context workload. The reason is that resource contention in the 8-context workload results in a higher cache miss rate. Comparing the of with that of and P, outperforms the latter. The difference arises primarily because is more responsive to L2 cache misses which have a more significant impact on. By only monitoring outstanding L1 cache misses, and P are limited in their ability to respond to L2 cache misses and hence the effectiveness of reduction is restricted. Another interesting finding is that advanced fetch policies usually achieve a lower on the and workloads than the scheme. Conversely, on the 4- context CPU workloads, the advanced fetch policies can exhibit higher P P P P P P CPU P P IQ ROB FU Reg LSQ_data LSQ_tag DL1_data DL1-tag (a) 4 Contexts (b) 8 Contexts Figure 6. Microarchitecture under Different Fetch Policies Accounting for both reliability and performance, we present the IPC/ metric in Figure 7 for the five advanced fetch policies. All data shown in Figure 7 are normalized to the baseline results. A higher IPC/ means a greater amount of work can be completed between error occurrences and therefore suggests a better reliability/performance tradeoff. Mechanisms that reduce both the and the IPC may be worthwhile only if they increase the MITF. As shown in Figure 7, does the best overall concerning both reliability and performance. also shows its potential to achieve a desirable reliability-performance tradeoff. The reliability efficiency of the policy reaches its maximum value on the workloads. The reason for this is the ICOUT baseline is worse than at both throughput and reliability aspects due to memorybound threads. They penalize performance by clogging the IQ and increase by bringing in long-latency ACE bits. Contrary to, the policy handles memorybound threads better by solving both problems and achieves an improvement on both throughput and reliability. However, fails to outperform in the CPU workload due to the small number of cache misses present within that workload. Throughput and vulnerability can only benefit slightly from. In the workload, 175

8 is reduced substantially along with a slight performance improvement. On CPU workloads, due to the low cache miss frequency, the difference between an advanced fetch policy and the baseline scheme diminishes. As the number of thread contexts increases from 4 to 8, more cache misses occur on CPU bound workloads. As a result, the policy starts to manifest better reliability efficiency. When all threads become memory bound, no thread can make progress under the policy, and therefore the reliability efficiency drops again. Figure 7 shows that overall and show better reliability efficiency than other examined fetch policies. Although throughput represents the overall performance, it does not reflect the progress made by each individual thread hence fairness across all threads is not considered. To address this, we use weighted speedup and harmonic IPC as proposed in [14, 15]. Figure 8 shows performance and reliability tradeoff evaluation using these alternative performance metrics. The weighted speedup is obtained by normalizing each thread s performance in SMT to that in single thread execution and adding them together. The resulting value reflects the effective throughput of the workload compared to single thread execution. Harmonic IPC calculated the harmonic mean of the weighted IPC to quantify both performance and fairness. There are several interesting observations when different performance IPC/ metrics are used to quantify the reliability efficiency. As shown in Figure 7, if weighted IPC is used, always outperforms the other fetch schemes across all microarchitecture structures. When trade-offs are measured with weighted speedup, as shown in Figure 8 (a), the difference between and the others diminishes. When harmonic IPC is used, becomes a promising mechanism to achieve a good performance and reliability tradeoff on microarchitecture components such as the FU, DL1, and register file. The reason is that in contrast to, the policy does not yield a significant reliability benefit on the FU, DL1, DTLB and register. However, favors the execution progress of threads with low L2 cache miss rates and thus can penalize the weighted speedup and harmonic mean IPC. As a result, is the best choice for balancing reliability, throughput and fairness on those hardware structures. On the other hand, remains the best choice for the IQ, ROB and LSQ even when harmonic IPC is used. This is due to the significant reduction on these structures when the policy is applied. Although the average harmonic IPC yielded by on 4-context and 8- context workloads is 16% lower than that yielded by, the reduction achieved by is more impressive, around on average. On these structures, the benefit due to reduction out-weighs the loss in fairness due to unbalanced resource allocation P 1 0 Figure 7. A Comparison of Reliability Efficiency of Different Fetch Policies Harmonic IPC/ P 0 Weighted Speedup/ P IQ ROB FU Reg LSQ_data LSQ_tag DL1_data DL1-tag Figure 8(a) and (b). A Comparison of Reliability Efficiency using Different Performance Metrics 176

9 5. Discussion of Experiment Results The extensive simulation results we present in Section 4 reveal the challenges and opportunities in optimizing reliability for SMT microarchitecture. For example, our study shows that the highest vulnerability is likely to occur on the shared structures. To avoid vulnerability hotspots in their designs, architects need to first focus on protecting shared SMT microarchitecture structures from soft error strikes. Among the shared microarchitecture components, the IQ and register file stand out with respect to vulnerability, and the IQ particularly stands out in scenarios with many active threads. For single thread execution, several architectural level techniques [6] have been proposed to alleviate the potential soft error strikes on vulnerable program microarchitecture states. In [6], longlatency instructions are flushed to maintain a low IQ. Our experimental results show that although can effectively reduce the vulnerability of an offending thread on a SMT architecture, it does not ensure that the newly freed resource can be allocated to other threads in a manner that is always favorable to reliability. Our study indicates that in addition to coping with cache misses, limiting the scale of resources that can be shared by multiple threads can also contribute to reduction. In SMT processors, ILP as well as TLP is limited by program properties and available threads. By increasing the size of microarchitecture structures, architects aim to exploit more parallelism. Nevertheless, the performance gain does not correlate with the scale of hardware resources in a linear manner. This effect, on the other hand, has a great influence on reliability, because the increased size of a microarchitecture structure is likely to bring in more inflight instructions and expose more program states to softerror strikes. Reliability-aware resource allocation avoids resource abuse by threads with a high fraction of ACE-bits within the pipeline. For example, a thread s instructions can form a long dependence chain and clog the IQ. In this case, reliability-aware resource allocation, such as enabling predefined static IQ partitions for each thread, could help with vulnerability reduction. As a second example, reliability-aware fetch throttling, which is built on top of exiting fetch schemes and extended with reliability awareness of individual threads, can be used to maintain a low while achieving a high throughput. By comparing results that measure throughput/, weighted IPC/ and harmonic IPC/, we find that the reliability efficiency yielded on diminishes when fairness is considered. exhibits the potential to reduce vulnerability while maintaining fairness. However, our analysis indicates that a limitation of is that it stalls fetching for the offending thread upon detecting a L2 cache miss. This delay can result in introducing a significant number of ACE bits into pipeline. If the L2 cache misses can be predicted when the offending instruction enters the pipeline, fetch can be stalled immediately to ensure that no ACE bits are brought into pipeline. By incorporating a L2 cache miss prediction mechanism, can be further enhanced to reduce vulnerability while maintaining fairness. 6. Related Work There is a growing amount of work aimed at characterizing soft error behavior at the microarchitecture level. In [2, 3] detailed processor RTL models were used to estimate microarchitecture reliability. The RTL models contain all of the detailed information about the microprocessors. However, the simulation slowdown of RTL models is too expensive for architecture studies, in which the tradeoffs between many hardware configurations need to be considered. Moreover, these models are generally not available during the architectural exploration phase of a microprocessor design. In [4, 5] Mukherjee et al. introduced the concept of architecturally correct execution (ACE) bits to compute the of microarchitecture structures using a performance model. The vulnerability of hardware structures (e.g. instruction queue, execution unit, TLB and caches) of an Itanium2-like IA64 processor was studied [4, 5]. To reduce instruction queue, Weaver et al. [6] propose to selectively squash instructions when long delays are encountered. They examine cache-miss triggers and squashing actions to remove existing instructions from the instruction queue. In [7], Li and Adve developed SoftArch, an architecture level tool for modeling and analyzing soft errors. The SoftArch framework estimates reliability using a probabilistic model of the error generation and propagation process in a processor. In [8], microarchitecture vulnerability phase behavior during program execution is observed and its predictability is studied. As a complementary approach to computation, statistic fault injection has been used in several studies [2, 3, 4] to evaluate architectural reliability. To obtain statistic significance, a large number of experiments need to be performed on an investigated hardware component. To our knowledge, microarchitecture soft error vulnerability analysis so far has been exclusively focused on single thread execution environments. In the past, there have been numerous studies on performance characterization and optimization for SMT architectures. Power and thermal issues in SMT architecture design has also been studied recently in [23]. In [24, 25], SMT architectures were used to perform redundant thread execution for transient fault detection and recovery. Nevertheless, the implication of simultaneous multithreading on hardware reliability and the reliability efficiency of SMT architectures remain unexplored. 7. Conclusions The use of simultaneous multithreading techniques enhances system overall performance but also raise questions over their susceptibility to soft error strikes. In this paper, we provide an in-depth analysis of the impact of multithreading on processor vulnerability to transient 177

10 faults. We extend a SMT performance simulator with microarchitecture vulnerability computation models. Using programs from the SPEC CPU 2000 benchmark suite, we analyze the microarchitecture vulnerability of representative program mixes. Our major conclusions from this work are: (1) In general, the use of SMT techniques increase the vulnerability of shared microarchitecture structures; and the degree of vulnerability increases with the number of running threads. Our experimental results indicate that the IQ and register are more susceptible to soft errors than other structures we studied. To avoid vulnerability hotspots in their designs, architects need to first focus on protecting those shared SMT microarchitecture structures from soft error strikes; (2) Comparing the reliability of SMT to superscalar processors, multithreaded architectures exhibit an increased overall microarchitecture vulnerability, although are likely to reduce microarchitecture vulnerability of individual threads; (3) The microarchitecture vulnerability is sensitive to instruction fetch policies. Among all fetch policies we investigated, and are the most attractive schemes to reduce vulnerability while maintaining good reliability efficiency. However the advantage diminishes when fairness is taken into consideration. By analyzing experimental results, we have gained insightful knowledge on the SMT microarchitecture vulnerability profile and how this profile changes with workload behavior, the number of threads and fetch policies. We point out that there are several alternative optimization opportunities to improve reliability, such as enhancing the fetch policy using predicted cache misses to proactively stop fetching or dynamically distributing resources among threads based on their vulnerability profile. We plan to explore thread-aware reliability optimization techniques for SMT architecture in our future work. Acknowledgment This research is partially supported by Microsoft Research Trustworthy Computing award no and by NASA award no. NCC References [1] P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi, Modeling the Effect of Technology Trends on Soft Error Rate of Combinational Logic, In Proceedings of the International Conference on Dependable Systems and Networks, [2] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August and S. S. Mukherjee, Design and Evaluation of Hybrid Fault-Detection Systems, In proceeding of International Symposium on Computer Architecture, [3] N. J. Wang, J. Quek, T. M. Rafacz and S. J. Patel, Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline, In proceeding of International Conference on Dependable Systems and Networks, [4] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor, In Proceedings of the International Symposium on Microarchitecture, [5] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, In Proceedings of the International Symposium on Computer Architecture, [6] C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor, In Proceedings of the International Symposium on Computer Architecture, [7]X. D. Li, S. V. Adve, P. Bose, and J. A. Rivers, SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors, In Proceedings of the International Conference on Dependable Systems and Networks, [8] X. Fu, J. Poe, T. Li and J. Fortes, Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior, In Proceedings of the International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006 [9] D. Tullsen, S. Eggers, and H. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism, In Proceedings of the International Symposium on Computer Architecture, [10] P. Kongetira, K. Aingaran, and K. Olukotun, Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro, vol. 25, no. 2, pp , Mar/Apr, [11] D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton, Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal, 6(1), Feb [12]J. Seng, D. Tullsen, and G. Cai, Power-Sensitive Multithreaded Architecture, In Proceedings of the International Conference on Computer Design, [13] J. Hasan, A. Jalote, T. N. Vijaykumar, and C. Brodley, Heat Stroke: Power-Density-based Denial of Service in SMT, In Proceedings of the International Symposium on High-Performance Computer Architecture, [14]K. Luo, J. Gummaraju, and M. Franklin, Balancing Throughput and Fairness in SMT Processors, In Proceedings of the International Symposium on Performance Analysis of Systems and Software, [15] S. Raasch and S. Reinhardt, The Impact of Resource Partitioning on SMT Processors, In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, [16] F. Cazorla, E. Fernandez, A. Ramirez, and M. Valero, Dynamically Controlled Resource Allocation in SMT Processors, In Proceedings of the International Symposium on Microarchitecture, [17]E. W. Czeck and D. Siewiorek, Effects of Transient Gate-level Faults on Program Behavior, In Proceedings of the International Symposium on Fault-Tolerant Computing, [18] Joseph Sharkey, M-Sim: A Flexible, Multithreaded Architectural Simulation Environment, Technical Report CS-TR-05-DP01, Department of Computer Science, State University of New York at Binghamton, [19] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, In Proceedings of the International Symposium on Computer Architecture, [20] D. Tullsen and J. Brown, Handling Long-latency Loads in a Simultaneous Multithreading Processor, In Proceedings of the International Symposium on Microarchitecture, [21]A. El-Moursy and D. H. Albonesi, Front-end Policies for Improved Issue Efficiency in SMT Processors, In Proceedings of the International Symposium on High Performance Computer Architecture, [22] F. J. Cazorla, E. Fernandez, A. Ramirez, and M. Valero, Dcache Warn: an I-Fetch Policy to Increase SMT Efficiency, In Proceedings of International Parallel and Distributed Processing Symposium, [23] J. Hasan et al., Heat Stroke: Power-Density-Based Denial of Servie in SMT, In Proceedings of the International Sympoisum on High- Performance Computer Architecture, [24] T. N. Vijaykumar, I. Pomeranz and K. Cheng, Transient-Fault Recovery via Simultaneous Multithreading, In Proceedings of the International Symposium on Computer Architecture, [25] S. Reinhardt and S. Mukherjee, Transient Fault Detection via Simultaneous Multithreading, In Proceedings of the International Symposium on Computer Architecture,

Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior

Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior in Fu James Poe Tao Li José A. B. Fortes Department of ECE University of Florida xinfu@ufl.edu Department of ECE University of Florida