Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior

Size: px
Start display at page:

Download "Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior"

Transcription

1 Characterizing Microarchitecture Soft Error Vulnerability Phase Behavior in Fu James Poe Tao Li José A. B. Fortes Department of ECE University of Florida Department of ECE University of Florida Department of ECE University of Florida Department of ECE University of Florida Abstract Computer systems increasingly depend on exploiting program dynamic behavior to optimize performance, power and reliability. Prior studies have shown that program execution exhibits phase behavior in both performance and power domains. Reliabilityoriented program phase behavior, however, remains largely unexplored. As semiconductor transient faults (soft errors) emerge as a critical challenge to reliable system design, characterizing program phase behavior from a reliability perspective is crucial in order to apply dynamic fault-tolerant mechanisms and to optimize performance/reliability trade-offs. In this paper, we compute run-time program vulnerability to soft errors on four microarchitecture structures (i.e. instruction window, reorder buffer, function units and wakeup table) in a high-performance out-of-order execution superscalar processor. Experimental results on the SPEC2 benchmarks show a considerable amount of time varying behavior in reliability measurements. Our study shows that a single performance metric, such as IPC, cache miss or branch misprediction, is not a good indicator for program vulnerability. The vulnerabilities of the studied microarchitecture structures are then correlated with program code-structure and run-time events to identify vulnerability phase behavior. We observed that both program code-structure and run-time events appear promising in classifying program reliability phase behavior. Overall, performance counter based schemes achieved an average Coefficient of Variation (COV) of 3.5%,.5%,.3% and 5.7% on the instruction queue, reorder buffer, function units and the wakeup table, while basic block vectors offer COVs of.9%, 5.%, 5.% and 6% on the four studied microarchitecture structures respectively. We found that in general, tracking performance metrics performs better than tracking control flow in identifying reliability phase behavior of applications. To our knowledge, this paper is the first to characterize program reliability phase behavior at the microarchitecture level. 1. Introduction As computer architecture becomes more complex, its efficiency increasingly depends on the capability of tuning hardware and software resources at run-time for different program behavior. Previous research [1, 2, 3,, 5] shows that program execution characteristics, such as performance and power features, exhibit time varying behavior called phases. Various techniques [6, 7,, 9, 1, 11,, 13, 1] have been proposed to capture program phases within which workloads show homogeneous characteristics of interest. In the past, program phases have been exploited in both performance and power domains. As semiconductor transient faults (soft errors) emerge as a critical threat to reliable system design, understanding and characterizing program dynamic behavior from the perspective of dependability becomes important. The rapidly increasing soft error rates and the advent of billiontransistor chips suggest that it is infeasible to protect every transistor from soft error strikes, making workload dependent dynamic fault-tolerant techniques attractive. For example, when long latency instructions (e.g. loads that miss caches, floating point operations) in programs are encountered in the pipeline, a microprocessor can selectively squash instructions that sit in the instruction queue to reduce the likelihood of faults due to soft error strikes to that structure [15]. These techniques can be efficiently applied to optimize performance/reliability trade-offs only if the program related reliability phases and the critical execution regions required for fault tolerance can be identified. Prior work [2,, 6, 7,, 17] showed that program phases can be used to effectively evaluate and optimize performance and power. Similarly, if programs exhibit reliability-oriented phases and these phases can be accurately and efficiently identified, we can exploit workload characteristics in the design, optimization and evaluation of dependable computer architecture. However, to our knowledge, this issue has not been addressed by the community so far. This paper focuses on characterizing and classifying program reliability phase behavior on state-of-the-art processor microarchitecture. In earlier work, we developed a microarchitecture level software vulnerability analysis framework Sim-SODA [1]. Employing the Sim-SODA framework and the SPEC2 benchmarks, we compute run-time program vulnerability to soft errors on four microarchitecture structures (i.e., instruction queue, reorder buffer, functional units and wakeup table). Experimental results show that the microarchitecture components studied manifest significant time varying behavior in their run-time reliability characterization. To identify appropriate reliability estimators, we used various performance metrics and examined their correlations with vulnerability factors at the microarchitecture level. We found that in general, using a simple performance metric is insufficient to predict program vulnerability behavior. We then explored code structure based and run-time event based phase analysis techniques and compared their reliability phase prediction abilities. This work makes the following contributions: We quantitatively analyzed the run-time soft error vulnerability on four microarchitecture structures which play an important role in determining the reliability of a high performance microprocessor with dynamic scheduling and out-of-order execution. We found that all studied microarchitecture structures manifest significant variability in their run-time reliability characteristics. We observed a fuzzy correlation between hardware reliability and a single performance metric (such as IPC, branch prediction rate and cache miss rate) across the simulated workloads. This suggests that a simple

2 performance metric is not a good indicator for program reliability behavior. We investigated reliability-oriented phase classification using both code structure based and run-time event based schemes. We compared the effectiveness of different techniques and explored the cost-effective reliability phase classification methods. We further explored the use of hybrid information to guide reliability phase selection. Finally, we performed a sensitivity analysis to understand the applicability of different techniques on machines with different processor and hardware structure configurations. The rest of this paper is organized as follows. Section 2 describes a method used to compute program reliability and details the techniques used to model vulnerability of the studied microarchitecture structures. Section 3 presents our experimental setup, including the architectural level reliability simulator Sim- SODA, the simulated machine configuration and the studied benchmarks. Section profiles time varying behavior of microarchitecture vulnerabilities and their correlation with commonly used performance metrics. Section 5 explores control flow and run-time event based approaches and evaluates their abilities in classifying reliability-oriented phases. Section 6 summarizes the paper and outlines our future work. 2. Microarchitecture Level Soft Error Vulnerability Analysis At the microarchitecture level, program vulnerability to soft errors can be modeled using several methodologies. For example, Li and Adve [19] estimate reliability using a probabilistic model of the error generation and propagation process in a processor. In the past, statistic fault injection has also been used in several studies [2, 21, 22] to evaluate architectural reliability. In this work, we estimate the reliability of processor microarchitecture structures using the Architectural Vulnerability Factor (AVF) computing methods introduced in [23, 2]. This section provides a brief background of AVF and describes how we calculated the AVF of the studied microarchitecture structures. The AVF of a hardware structure is the probability that a transient fault in that hardware structure can cause erroneous output of a program. The overall hardware structure s error rate is determined by two factors: the device raw error rate, mainly determined by circuit design and processing technology, and the AVF. Since the hardware raw error rate normally does not vary with code execution, AVF can be used as a good reliability estimator to quantify how vulnerable the hardware is to soft errors at different program execution stages. To compute AVF, one needs to distinguish those bits that affect the final system output and those that do not. A subset of processor state bits required for architecturally correct execution (ACE) are called ACE bits [23]. The AVF of a hardware structure in a given cycle is the percentage of ACE bits that the structure holds, and the AVF of a hardware structure during program execution is the average AVF at any point in time. In practice, identifying un- ACE bits, the processor state bits that do not affect correct program execution, is much easier. Examples of locations that often contain un-ace bits include NOPs, idle or invalid states, uncommitted instructions and dynamically dead instructions and data. An instruction is considered dynamically dead if its result is not used by any other instructions and therefore will not affect the final output of the program. A cycle accurate execution driven simulator can be used to identify un-ace bits and to track the residency cycles of the un-ace bits in hardware structures. To compute the AVF of the entire processor, the AVF of all hardware components are added together after they are weighted with the number of bits in each structure. In this paper, we focus on characterizing the AVF phase behavior of an individual microarchitecture structure since we believe that a componentbased reliability analysis is more suitable for the design and optimization of reliability-aware architecture. For example, the hardware components that yield high AVFs can be identified and protected, either at the design stage or at run-time. In the following subsections, we describe the methodologies to compute AVFs of the four studied microarchitecture structures: instruction window, reorder buffer, function units and wakeup table. We chose these components due to 1) their significance in reliable program execution and 2) the heterogeneity of their AVF computation Instruction Window Modern high-performance processors use the instruction window to support dynamic scheduling and out-of-order execution. To effectively exploit instruction level parallelism, the instruction window buffers a significant amount of program runtime states, making it vulnerable to soft error strikes. For example, Mukherjee et al [23] reported that the instruction window AVF number ranges between 1% and 7% on the slices of SPEC 2 benchmarks. To model the AVF of the instruction window, we classify each bit in the instruction window as an ACE-bit or an un-ace bit. For example, all bits stored in idle or invalid entries are un-ace bits. All bits except opcodes in NOPs and prefetch instructions are also un-ace bits. In our study, we identified the following sources that contribute to the un-ace bits in the instruction window: uncommitted instructions, NOPs, prefetch instructions, dynamically dead instructions, trivial instructions, idle entries and states. To identify uncommitted instructions, NOPs, prefetch instructions and idle entries using a performance model is straightforward. However, tracking the dynamically dead instructions can be more computationally involved. We implement a post-graduation instruction analysis window as proposed in [23] to detect the future use of an instruction s result, and identify dynamically dead instructions. We set the analysis window to, instructions which can cover most of the future use of an instruction s result [1]. Instructions which have no consumer in the following, instructions are conservatively marked as unknown instructions. In [23], Mukherjee et al identified logical masking instructions (e.g. compare instructions prior to a branch, bitwise logical operations and 32-bit operations in a 6-bit architecture) as a source of un-ace bits. An operand and its bits are logically masked and can be attributed to un-ace bits if the operand does not influence the result of an instructions execution. In this study, we identified additional logical masking bits. We found that the bits used to encode the specifiers of source registers which hold logically masked values are un-ace bits. This is because while a corrupted register specifier may cause the processor to fetch the wrong data from a different register, the computation result will not be altered because of the logical masking effect. Additionally, we extend logical masking instructions to trivial instructions [25]. Trivial instructions are those computations whose output can be

3 determined without performing the computation, so they cover all the un-ace bits that logical masking instructions can identify. We classified the trivial instructions into three categories [1] and identified the corresponding un-ace bits. Note that trivial instructions can be dynamically dead instructions also. If an instruction is both a trivial instruction and a dynamically dead instruction, we attribute it to the dynamically dead instructions because more un-ace bits can be derived from that type of instruction Reorder Buffer In an out-of-order execution microprocessor, program execution states are buffered in the reorder buffer (ROB) until they are retired from the pipeline. To effectively exploit ILP, modern processors use a large ROB, implying its AVF can greatly affect the AVF of the entire chip. Bits in each ROB entry are allocated for an on-the-fly instruction. An ROB entry includes the instruction number, the destination register specifier, and the operand and control bits. Sources of un-ace bits that we identified for the ROB are the same as those for the instruction window. Similarly, we used a performance model augmented with an instruction analysis window and trivial instruction classification to identify these scenarios in the ROB [1] Function Unit In this study, we modeled integer function units used by an Alpha-26 processor unless otherwise specified [26]. Similar to [23], we assume each function unit has about 5% control latches and 5% data-path latches, and that the data-path within it has a width of 6 bits. We applied trivial instruction analysis to each function unit to further attribute un-ace bits to different instructions. The semantics of trivial instructions implies that one input value to the function unit can be un-ace. There are other instructions that only produce 32-bit outputs. In that case, the upper 32-bits in the output data path become idle. We observed that function unit ACE bits are mainly contributed by the operands of ACE instructions (instructions which affect program final output) as well as the output of those instructions [1]. Because the majority of instructions have two input operands and one output operand, the input bits contribute more ACE bits in the function units than the output bits do. 2.. Wakeup Table When an instruction completes its execution, its destination register ID is broadcasted to all of the instructions in the window to inform dependent instructions of the availability of the result. Each entry compares the broadcasted register ID with its own source register ID. If there is a match, the source operand is latched and the dependent instruction may be ready to execute. This register ID broadcast and associative comparison is called instruction wake-up. A soft error that results in an incorrect match between a broadcasted physical register ID and a corrupted tag may cause instructions waiting for that operand to be issued prematurely. A single bit error in the tag array that results in a mismatch where there should have been a hit can prevent ready instructions from being issued, causing a deadlock in issuing instruction. Therefore, the wake-up table is vulnerable to soft error strikes. We have developed the following method to compute the AVF of the wake-up table: when a new instruction is allocated in the instruction window, the wake-up table records the renamed physical register IDs of instructions on which this instruction depends. There are two fields in each wake-up table entry to hold the renamed register IDs for the two source operands of an instruction. A field in the wake-up table entry becomes invalid once the source operand is ready. The operations on the wake-up table include fill, read and invalidate ; therefore, its nonoverlapping lifetime can be partitioned into fill-to-read, read-toread and invalidate-to-fill periods. Note that there is no read-toinvalidate component because the last read between fill and invalidate will cause a match between the stored register ID and the broadcasted register ID. Once there is a match, the field in the wake-up table will become invalid immediately. In other words, the lifetime of the read-to-invalidate component in the wake-up table is always zero. Therefore, we combine fill-to-read and readto-read components together, and attribute the invalidate-to-fill component as un-ace. 3. Experimental Methodology This section describes our experimental framework, simulated machine architecture and benchmarks The Sim-SODA Reliability Estimation Framework The models described in Section 2 can be integrated into a range of architectural simulators to provide reliability estimates. We have developed Sim-SODA [1], a unified simulation framework that models software reliability on microprocessorbased systems. To implement these AVF models into a unified, timing accurate framework, we have instrumented the Sim-Alpha architectural simulator, which has been validated to the Alpha 26 processor [27, 2]. The Sim-SODA framework also includes microarchitecture level reliability models for cache, TLB, and load/store queue. Through cycle-level simulation, both microarchitecture and architecture states are classified into ACE/un-ACE bits and their residency and resource usage counts are generated. This information is then used to estimate the reliability of various hardware structures. To characterize AVF phase behavior, we broke a program s execution into continuous non-overlapping intervals. An interval is a continuous portion of instructions in a program. The Sim-SODA framework outputs a set of AVFs of the important structures and also a set of performance counters at the end of every simulation interval, and resets them to zero at the beginning of the next interval. We collected microarchitecture structures AVFs and program performance counters for each interval Simulated Machine Configuration We simulated an Alpha-26-like -way dynamically scheduled microprocessor with a two level cache hierarchy. The baseline microarchitecture model is detailed in Table Simulated Workloads For our reliability-oriented phase characterization experiments, we used 1 SPEC2K integer benchmarks and obtained AVF and performance characteristics using Sim-SODA. Obtaining accurate timing information is critical for AVF estimation. We didn t include SPEC2K FP benchmarks because Sim-Alpha does not model floating point pipeline execution accurately: it shows 21.5% error across the set of floating point benchmarks, compared with 6.6% error across the set of SPEC integer benchmarks [27]. To perform this study within a reasonable amount of time, we used the MinneSPEC [29] reduced

4 input datasets. However, bzip2 and gzip still require a significant amount of time to finish simulation. Therefore, we could not include their results in this paper draft. In order to collect enough interval information for accurate AVF phase behavior classification, we chose each benchmark s interval size based on the total dynamic instructions executed. For instance, and have 1, instructions in each interval, while has an interval size of 2, instructions. Table 1. Simulated Machine Configuration Parameter Configuration Pipeline depth 7 Integer ALUs/ multi / Integer ALU/multi latency 1/7 Fetch/Slot/Map/Issue/Commit ////11 instructions per cycle Issue queue size 2 Reorder buffer size Register file size Load/store queue size 32 Branch predictor Hybrid, K global + 2-level 1K local+ K choice Return address stack 32-entry Branch mispredict penalty 7 cycles L1 instruction cache 6KB instruction/6kb data,2-way, 6B line, 1-cycle latency L1 data cache 6KB instruction/6kb data,2-way, 6B line, 3-cycle latency L2 cache 2KB, direct mapped, 6B line, 7-cycle latency TLB size -entry ITLB/-entry DTLB fully-associative MSHR entries /cache Prefetch MSHR 2/cache Victim buffer -entry, 1 cycle hit latency. Characterizing Run-time Microarchitecture Vulnerability to Soft Errors This section profiles run-time microarchitecture structure reliability characteristics. In section.1, we illustrate and quantify the time varying behavior of microarchitecture vulnerability to soft errors. Section.2 correlates program reliability statistics with simple performance metrics..1. Time Varying Behavior of Microarchitecture Vulnerability Instruction Window AVF (%) Instruction Window AVF (%) % 2% % 6% % 1% Execution Time % 2% % 6% % 1% Execution Time Figure 1. Run-time Instruction Window AVF Profile on Benchmark and Figure 1 shows the run-time instruction window AVF profile on the benchmarks and. Each point of the plots represents the AVF of an interval with a size of 1, instructions. (We don t show all the 1 benchmarks AVF profiles because of space limitations). As can be seen, during program execution, the instruction window s vulnerability shows significant variation between intervals, and some intervals also show similar AVF behavior regardless of temporal adjacency. Table 2 lists AVF statistics in terms of mean and Coefficient of Variation (COV) for all studied structures and all simulated benchmarks. The COV has been widely used in evaluation of phase analysis techniques. It is the standard deviation divided by the mean. In other words, CoV measures standard deviation as a percentage of the mean. Table 2 shows that although both the instruction window and the ROB are used to support out-of-order execution, they show significant discrepancy in their AVFs: the ROB s AVF is lower than the instruction window s AVF on a majority of studied benchmarks. This can be explained as follows: The Alpha 26 processor has separate integer and floating point instruction windows [26]. The integer instruction window has 2 entries. The ROB is used to hold all types of instructions and the size of the ROB is entries. Due to the lack of floating point operations, the fraction of idle bits in the ROB is much higher than that in the instruction window. Table 2 also shows that the COVs of the runtime AVFs on,, and are much higher than those on the remaining benchmarks. This indicates that these programs contain complex vulnerability phases which may be challenging for accurate phase classification. Table 2. Variation in Run-time Microarchitecture Vulnerability Inst. Window Reorder Function Wakeup Table Benchmarks AVF Buffer AVF Unit AVF AVF MEAN COV MEAN COV MEAN COV MEAN COV 23 6% 23 2% 17 1% 11 2% 37 7% 17 7% 31 1% 9 17% 27 11% 1 15% 1 1% 5 % 2 96% 3% 5 1% 9 5% 32 % 13 13% 19 5% 6 2% 29 9% 7% % 13% 13% 15 1% 1 % 1 2% 26 2% 13 5% 17 23% 3 2% 9% 15 2% 19 1% 1% 29 25% 1 37% 15 17% 7 33% 31 29% 17 2% 17 31% 23%.2. How Correlated is AVF to Performance Metrics If a hardware structure s AVF shows strong correlations with a simple performance metric, its AVF can be predicted using that performance metric across different benchmarks. To determine how correlated vulnerability is to performance metrics, we calculated the correlation coefficients between the run-time hardware structure s AVF and each of the simple performance metrics using statistics that were gathered from each interval during program execution. We chose widely used performance metrics such as IPC, branch prediction rate, L1 data and instruction miss rates and L2 cache miss rate. Figure 2 shows there are fuzzy correlations between the AVFs and the different performance metrics. For example, for the instruction window, IPC shows strong correlation with AVF on the benchmarks,, and, while yielding near zero correlation coefficients on and. Intuitively, high IPC reduces the ACE bits residency time in a microarchitecture structure, resulting in a reduction of the AVF. On the other hand, the high ILP (usually manifested by high IPC) in a program can cause the microprocessor to aggressively bring more instructions into the pipeline, increasing the total number of ACE bits in a microarchitecture structure. Interestingly, we observed that the same performance metric (e.g. IPC) can exhibit both positive and negative correlations with the AVFs of different structures running on the same benchmark. For example, on, the correlation coefficient between IPC and the ROB s AVF is -.9, while the correlation coefficients between IPC and

5 other structure s AVFs are all around.99. As mentioned in Section 2, ACE bits residency time plays an important role in determining AVF. For the same instruction, its residency time in different structures can vary significantly. For example, when a processor executes a low ILP code segment, few instructions can graduate. This can increase the ROB s AVF if those instructions contain a significant amount of ACE bits. Note that the low IPC doesn t necessarily mean a long residency time of instructions in the instruction window, the functional units and the wakeup table. The instructions may have been completed a long time before their final graduation from the ROB because of the out-of-order execution and in-order commitment. Overall, the results shown in Figure 2 suggest that we simply cannot use one performance metric to indicate hardware reliability behavior. Corrleation Coefficient Corrleation Coefficient Corrleation Coefficient Corrleation Coefficient ipc bpred_rate dl1miss_rate il1miss_rate l2miss_rate (a) Instruction Window ipc bpred_rate dl1miss_rate il1miss_rate l2miss_rate 5. Program Reliability Phase Classification Various phase detection techniques have been proposed in the past. In [2, 7, 11, 3, 32], phases are classified by examining the program s control flow behavior via the features of basic block, working set, program counter, call graphs and control branch counters. This allows phase tracking to be independent of the underlying architecture configuration. In [9, 1], performance characteristics of the applications are used to determine phases. These methods detect phase changes without using knowledge of program code structure; however, the identified phases may not be consistent across different architectures. There are also previous studies on comparing and evaluating different phase detection techniques. In [], Dhodapkar and Smith showed that basic block vectors perform better than instruction working set and control branch counters in tracking performance oriented phases. The studies reported in [11, 13] revealed the correlations between application s phase behavior with code signatures. Isci (b) Reorder Buffer ipc bpred_rate dl1miss_rate il1miss_rate l2miss_rate (c) Function Unit ipc bpred_rate dl1miss_rate il1miss_rate l2miss_rate (d) Wakeup Table Figure 2. The Correlations between AVFs and a Simple Performance Metric and Martonosi recently conducted a study [1] where they compared different phase classification techniques in tracking run-time power related phases. Although the above methods have been used in performance and power characterization, their effectiveness in classifying program reliability-oriented phases remains unknown. In this section, we examine program vulnerability phase detection using two popular techniques: the basic block vector based and the performance event counter based schemes. A complete comparison of all of the phase analysis techniques exceeds the scope of this paper. In section 5.1, we compare the effectiveness of different techniques and explore the cost-effective reliability phase classification methods. In section 5.2, we explore the use of hybrid information to guide reliability phase selection. In section 5.3, we perform a sensitivity analysis to understand the applicability of different techniques on machines with different processor configurations. To our knowledge, there has been no prior effort in investigating program reliability phase detection Basic Block Vectors vs. Performance Monitoring Counters A basic block is a single-entry, single-exit section of code with no internal control flow. Therefore, sequences of basic blocks can be used to represent control flow behavior of applications. In [2], Sherwood and Calder proposed the use of Basic Block Vectors (BBV) as a metric to capture a program s phase behavior. They first determine all of the unique basic blocks in a program, create a BBV to represent the number of times each unique basic block is executed in an interval, and then weight each basic block so that instructions are represented equally regardless of the size of the basic block. In this paper, we quantify the effectiveness of basic block vectors in capturing AVF phase behavior across several different microarchitecture structures. In [9], Isci and Martonosi showed that hardware performance monitoring counters (PMC) can be exploited to efficiently capture program power behavior. To examine an approach that uses performance characteristics in AVF phase analysis, we collected a set of performance monitor counters (PMC) to build a PMC vector for each execution interval. We then used the PMC vectors as a metric to classify AVF phases. In our experiments, we instrumented the Sim-SODA framework with a set of 1 performance counters and dumped the statistics of these counters for each execution interval. We then ran an exhaustive search on the 1 counters to determine the 15 counters (shown as PMC-15 in Table 3) that were able to characterize the AVF phases most accurately. We chose the size of the counter vector to be 15 because BBV is commonly reduced to 15 dimensions after random projection [2]. To investigate whether the AVF phases are predictable using a small set of counters, we further reduced the PMC vector to 5 dimensions (shown as PMC-5 in Table 3). The 5 counters were chosen based on their importance in performance characterization and their generality and availability across different hardware platforms. The BBVs of each benchmark were generated using the SimPoint tool set [2] and the PMC vectors were dumped for every interval during Sim-SODA simulation. We then ran the k- means clustering algorithm [31] to compare the similarity of all the intervals and to group them into phases. Table lists the number of phases identified by the k-means clustering algorithm. In this study, we also examined the use of hybrid information that

6 combined both BBV and PMC vectors to characterize reliability phases. More detailed information about the hybrid schemes can be found in section 5.2. As shown in Table, the BBV generates a slightly larger number of phases than the PMC-15. The PMC-15 scheme detects more phases than the PMC-5 on a majority of the studied benchmarks. Table 3. Events used by PMC-15 and PMC-5 Schemes Events PMC-15 PMC-5 Number of Loads Number of Stores Number of Branches Total Instructions Total Cycles Correctly Predicted Branches Pipeline Flushes due to Branch Misprediction Pipeline Flush due to other Exceptions Victim Buffer Misses Integer Register File Reads Integer Register File Writes L1 Data Cache Misses L1 Instruction Cache Misses L2 Cache Misses Data TLB Misses Squashed Instructions Idle Entry in IQ Idle Entry in ROB Table. Number of Phases Identified by the K-means Clustering Algorithm Benchmarks BBV PMC-15 PMC-5 Hybrid (5+1) Hybrid (5+15) BBV perlbmlk We use the COV to compare different phase classification methods. After classifying a program s intervals into phases, we examined each phase and computed the average AVF of all of the intervals within the phase. We then calculated the standard deviation of the AVF for each phase, and we divided the standard deviation by the average to get the COV. We calculate an overall COV metric for all phases by taking the COV of each phase, weighting it by the percentage of execution that the phase accounts for, and then summing up the weighted COV. Better phase classification will exhibit lower COV. In the extreme case, if all of the intervals in the same phase have exactly the same AVF, then the COV will be zero. Figure 3 shows that both BBV and PMC methods can capture program reliability phase behavior on all of the studied hardware structures. As compared to the COV baseline case without phase classification, the BBV based techniques achieve significantly higher accuracies in identifying reliability phases. On the average, the BBV based scheme reduces the COVs of AVF by 6x, 5x, 6x, and x on the instruction queue, ROB, function unit and wakeup table respectively. Figure 3 further shows that the PMC-15 scheme can achieve even higher accuracies in characterizing program reliability behavior. For example, PMC- 15 yields lower COVs than BBV on out of the 1 studied benchmarks for all of the examined structures. On benchmarks and, BBV performs better than PMC-15 because of the weak correlations between performance metrics and vulnerability shown in Figure 2. Overall, the PMC-15 scheme leads to an average COV of 3.5%,.5%,.3% and 5.7% on the instruction queue, reorder buffer, function unit and wakeup table, while BBVs achieve COVs of.9%, 5.%, 5.% and 6% on the four studied microarchitecture structures respectively. As we expected, Figure 3 shows that the COV of AVF for PMC-5 is higher than that for PMC-15 in most cases. This is because PMC-15 provides additional information that can improve the accuracy in phase clustering. The only exception is shown in Figure 3 (a). From Figure 2 (a), one can see that on benchmark, several performance metrics (e.g, IPC, DL1miss_rate and L2miss_rate) already exhibit strong correlation with the instruction window s AVF. When more performance metrics are included, it is possible that additional noise is also introduced into the clustering. However, this situation does not happen frequently as we observed only one exception among all the cases that we analyzed. On the average, reducing the PMC dimensionality from 15 to 5 increases the COV by 1.3%,.%, 1.5% and 2.1% on the studied microarchitecture structures respectively. In practice, gathering information for 5 counters is much easier than collecting statistics for 15 counters. This suggests that the PMC-5 scheme can be used as a good approximation to the more expensive PMC-15 scheme. Compared with the BBV technique, the PMC-5 scheme yields better AVF prediction for the ROB but exhibits worse accuracy for the wakeup table. For the instruction window and function unit, both schemes show similar performance in AVF classification. In summary, our experiments show that both BBV and PMC phase analysis have significant benefit in characterizing reliability behavior. We found that in general, the PMC-15 phase analysis performs better than the BBV based approach The Effectiveness of Hybrid Schemes We also explored two hybrid phase detection approaches that combine information used by both BBV and PMC schemes. In our first experiment, we used random projection to obtain a BBV with 1 dimensions and then concatenated the BBV with a PMC- 5 vector of the same program execution interval to form a hybrid vector, i.e. Hybrid(1+5). In our second experiment, we appended the PMC-5 vector to the original BBV to form a 2 dimension hybrid vector, i.e. Hybrid (15+5). We then invoked the k-means clustering algorithm on the two hybrid vectors. We compared the effectiveness of hybrid schemes with the default BBV and a BBV with a 2 dimensions of data. Figure shows the COVs on Hybrid(1+5), BBV, Hybrid(15+5) and BBV-2 schemes. As can be seen, simply concatenating BBV and PMC information does not show significant benefit in reliability phase detections. In fact, Hybrid(1+5) show slightly worse performance than BBV on a majority of benchmarks. Similar observation is held when Hybrid(15+5) and BBV-2 are compared. One possible reason is that when the PMC vector and the BBV are randomly merged together, their information can interfere with each other. Such interference can reduce the quality of AVF phase classification. Although we only examine two combinations of PMC and BBV in this paper, we think it may be possible to gain some benefit from other combinations that can take the advantages of both schemes while avoiding the interference between each other. Further exploration of effective hybrid phase detection schemes is one aspect of our future work.

7 2 PMC-15 PMC-5 BBV (a) Instruction Window PMC-15 PMC-5 BBV (c)function Unit 5.3. Sensitivity Analysis Phase analysis techniques which use architecture dependent characteristics (e.g. PMC) may be sensitive to the machine configuration. To evaluate the robustness of the prior studied phase classification techniques in the context of AVF phase classification, we examine the applicability of using PMC, BBV, and hybrid techniques to detect program reliability phases across different architectures. As mentioned in section 3, our baseline model is a -way issue Alpha-26-like superscalar microprocessor. To generate different machine configurations, we varied the bandwidth in each pipeline stage, modified the latency, size and associativity of the L2 cache, and resized the branch predictor, load store queue and register files. We created two additional configurations to model -issue and -issue processors. The -issue machine has physical registers, a 2KB hybrid branch predictor, a 6-etnry load/store queue, and an MB -way set associative L2 cache. We also increased the branch misprediction penalty to 1 cycles on the -issue processor. We used a similar approach to further scale up resources for the -issue machine. Using these new configurations, we ran the Sim-SODA framework on all of the benchmarks. The collected data were then processed using the k- means clustering algorithm to obtain the phase classification results (shown in Figure 5). Figure 5 shows that the COVs of reliability phase classification yielded on the -issue architecture are very close to those on the -issue architecture regardless of phase characterization techniques. This indicates that the various phase classification schemes we examined in sections 5.1 and 5.2 show similar performance in reliability phase detection across different architecture configurations. Interestingly, we observed that the COV on the and -issue machines is higher than that on the - issue machine. Intuitively, as the processor issue width increases, programs show larger variation in different characteristics because of the reduced resource constraint. The increased standard deviation can result in an increment in COV. However, if the configuration of the hardware structures (i.e. instruction window, ROB, function unit and wakeup table) is fixed, the 2 PMC-15 PMC-5 BBV (b) Reorder Buffer PMC-15 PMC-5 BBV (d)wakeup Table Figure 3. COVs yielded by Different Phase Classification Schemes impact of the increased issue width will be limited by a threshold. The structures AVF will not change significantly when the issue width surpasses that threshold. This explains the noticeable increase in COV when the machine configuration is varied from the -issue to the -issue and the diminishing variation in COV when the issue width is further increased to. Table 5 shows AVF phase classification error on different machine configurations. Due to space limitations, we report the average statistics of the 1 studied benchmarks. Compared with COV, AVF errors are slightly changed across different configurations. This is because the positive and negative errors may cancel with each other, resulting in an overall reduction in that metric. The results shown in Table 5 indicate that various techniques are still able to track the aggregated AVF behavior on the more aggressive and -issue machines despite of the increased COV in phase classification. Table 5. AVF Phase Classification Error on Different Machine Configurations Instruction Window Reorder Buffer issue issue issue issue issue issue PMC-5.7% 1.59% 1.6%.71% 2.2% 2.2% PMC % 2.32% 1.% 1.5% 1.2% 1.19% BBV 3.21% 3.19% 3.9% 1.75% 2.3% 2.71% BBV-2 2.7% 1.57% 1.79% 1.2% 1.26% 2.7% Hybrid(1+5) 1.2% 1.66% 1.71%.9% 1.61% 2.1% Hybrid(15+5) 1.% 1.% 1.39% 1.95% 1.35% 1.6% Function Unit Wakeup Table issue issue issue issue issue issue PMC-5 1.% 1.3%.9% 1.69% 3.2% 2.59% PMC % 2.5% 1.% 3.72% 2.27% 2.% BBV 3.67% 3.63% 2.7% 1.91%.7% 2.2% BBV % 2.% 1.6% 1.5% 3.3% 3.51% Hybrid(1+5).3% 1.29%.9% 1.% 1.96% 1.91% Hybrid(15+5) 1.6%.% 1.3% 1.1% 2.19% 1.99% 6. Conclusions and Future Work Phase analysis is becoming increasingly important for optimizing the efficiency of next generation computer systems. As semiconductor transient faults become an increasing threat to

8 Hybrid(1+5) BBV Hybrid(15+5) BBV-2 (a) Instruction Window 2 Hybrid(1+5) BBV Hybrid(15+5) BBV-2 (b) Reorder Buffer 2 Hybrid(1+5) BBV Hybrid(15+5) BBV-2 (c) Function Unit 2 Hybrid(1+5) BBV Hybrid(15+5) BBV-2 (d) Wakeup Table Figure. COVs yielded by Hybrid Schemes Instruction Window issue issue CoV(%) PMC-5 PMC-15 BBV BBV-2 Hybrid(1+5) Hybrid(15+5) Reorder Buffer issue issue CoV(%) PMC-5 PMC-15 BBV BBV-2 Hybrid(1+5) Hybrid(15+5) Function Unit issue issue CoV(%) PMC-5 PMC-15 BBV BBV-2 Hybrid(1+5) Hybrid(15+5) 2 15 Wakeup Table issue issue CoV (%) 1 5 PMC-5 PMC-15 BBV BBV-2 Hybrid(1+5) Hybrid(15+5) Figure 5. The COVs of Reliability Phase Classification on the -issue and -issue Processors

9 reliable software execution, it is imperative to design workload dependent fault tolerant mechanisms for future microprocessors which contain billions of transistors. Observing reliabilityoriented program phase behavior is the first step leading to dynamic fault tolerant systems which can be tuned to meet the workload reliability requirement. In this work, we studied the run-time reliability characteristics of four important microarchitecture structures. We investigated the applicability and effectiveness of different phase identification methods in classifying program reliability-oriented phases. We found that both code-structure based and run-time event based phase analysis techniques perform well in detecting reliability oriented phases. In general, the performance counter scheme outperforms the control flow based scheme on a majority of benchmarks. We also found that program reliability phases can be cost-effectively identified using five performance counters that are generally available across different hardware platforms. This suggests that on-line reliability phase detection and prediction hardware with low complexity can be built on a small set of performance counters. Although we used AVF analysis as the method to quantify microarchitecture reliability, we believe the observations we made in this paper are independent of the reliability characterization method used. Our future work includes phase characterization using statistical fault injection. Alternative clustering methods (e.g. regression tree) will also be used to analyze reliability phases. Additionally, we will explore on-line reliability phase prediction techniques that enable dynamic fault tolerant system design. For example, if structure s AVF is higher than the threshold, a dynamic fault tolerant technique will be triggered to avoid signification performance degradation caused by soft error. Extensions to the Sim-SODA simulator infrastructure and the creation of additional modules are also topics of future research. Acknowledgment This research is partially supported by NASA award no. NCC , and by Microsoft Research Trustworthy Computing award no James Poe is supported by a National Science Foundation Graduate Research Fellowship. References [1] T. Sherwood, E. Perelman and B. Calder, Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications, In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 21. [2] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior, In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 22. [3] E. Duesterwald, C. Cascaval and S. Dwarkadas, Characterizing and Predicting Program Behavior and Its Variability, In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 23. []. Shen, Y. Zhong and C. Ding, Locality Phase Prediction, In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2. [5] C. Isci and M. Martonosi, Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data, In Proceedings of the International Symposium on Microarchitecture, 23. [6] T. Sherwood, S. Sair and B. Calder, Phase Tracking and Prediction, In Proceedings of the International Symposium on Computer Architecture, 23. [7] A. Dhodapkar and J. Smith, Managing Multi-Configurable Hardware via Dynamic Working Set Analysis, In Proceedings of the International Symposium on Computer Architecture, 22. [] A. Dhodapkar and J. Smith, Comparing Program Phase Detection Techniques, In Proceedings of the International Symposium on Microarchitecture, 23. [9] C. Isci and M. Martonosi, Identifying Program Power Phase Behavior using Power Vectors, In Proceedings of the International Workshop on Workload Characterization, 23. [1] C. Isci and M. Martonosi, Phase Characterization for Power: Evaluating Control-Flow-Based Event-Counter-Based Techniques, In Proceedings of the International Symposium on High-Performance Computer Architecture, 26. [11] M. Annavaram, R. Rakvic, M. Polito, J.-Y. Bouguet, R. Hankins and B. Davies, The Fuzzy Correlation between Code and Performance Predictability, In Proceedings of the International Symposium on Microarchitecture, 2. [] J. Lau, S. Schoenmackers and B. Calder, Structures for Phase Classification, In Proceedings of International Symposium on Performance Analysis of Systems and Software, 2. [13] J. Lau, J. Sampson, E. Perelman, G. Hamerly and B. Calder, The Strong Correlation between Code Signatures and Performance, In Proceedings of the International Symposium on Performance Analysis of Systems and Software, 25. [1] J. Lau, S. Schoenmackers and B. Calder, Transition Phase Classification and Prediction, In Proceedings of the International Symposium on High Performance Computer Architecture, 25. [15] C. Weaver, J. Emer, S. S. Mukherjee and S. K. Reinhardt, Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor, In Proceedings of the International Symposium on Computer Architecture, 2. [] M. Huang, J. Renau and J. Torrellas, Positional Adaptation of Processors: Application to Energy Reduction, In Proceedings of the International Symposium on Computer Architecture, 23. [17] A. Iyer and D. Marculescu, Power Aware Microarchitecture Resource Scaling, In Proceedings of the Design Automation and Test in Europe Conference, 21. [1]. Fu, T. Li and J. Fortes, Sim-SODA: A Framework for Microarchitecture Reliability Analysis, In Proceedings of the Workshop on Modeling, Benchmarking and Simulation (Held in conjunction with International Symposium on Computer Architecture), 26. [19]. D. Li, S. V. Adve, P. Bose, J. A. Rivers: SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors, In Proceedings of the International Conference on Dependable Systems and Networks, 25. [2] S. Kim and A. K. Somani, Soft Error Sensitivity Characterization of Microprocessor Dependability Enhancement Strategy, In Proceedings of the International Conference on Dependable System and Networks, 22. [21] N. J. Wang, J. Quek, T. M. Rafacz and S. J. Patel, Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline, In Proceedings of the International Conference on Dependable Systems and Networks, 2. [22] E. W. Czeck and D. Siewiorek, Effects of Transient Gate-level Faults on Program Behavior, In Proceedings of the International Symposium on Fault- Tolerant Computing, 199. [23] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor, In Proceedings of the International Symposium on Microarchitecture, 23. [2] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, In Proceedings of the International Symposium on Computer Architecture, 25. [25] J. J. Yi and D. J. Lilja, Improving Processor Performance by Simplifying and Bypassing Trivial Computations, In Proceedings of the IEEE International Conference on Computer Design, 22. [26] R. E. Kessler, E. J. McLellan and D. A. Webb, The Alpha 26 Microprocessor Architecture, In Proceedings of the International Conference on Computer Design, 199. [27] R. Desikan, D. Burger, S. W. Keckler and T. Austin, Sim-alpha: A Validated, Execution-Driven Alpha 26 Simulator, Technical Report, TR-1-23, Dept. of Computer Sciences, University of Texas at Austin, 21. [2] R. Desikan, D. Burger and S. W. Keckler, Measuring Experimental Error in Microprocessor Simulation, In Proceedings of the International Symposium on Computer Architecture, 21. [29] A. J. KleinOsowski and D. J. Lilja, MinneSPEC: A New SPEC Benchmark Workload for Simulation Based Computer Architecture Research, Computer Architecture Letters, Vol. 1, June 22. [3] W. Liu and M. Huang, EPERT: Expedited Simulation Exploiting Program Behavior Repetition, In Proceedings of International Conference on Supercomputing, 2. [31] J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, [32] R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu and S. Dwarkadas, Memory Hierarchy Reconfiguration for Energy and Performance in General Purpose Architectures, In Proceedings of the Internationsl Symposium on Micrarchitecture, 2.

Sim-SODA: A Unified Framework for Architectural Level Software Reliability Analysis

Sim-SODA: A Unified Framework for Architectural Level Software Reliability Analysis Sim-SODA: A Unified Framework for Architectural Level Software Reliability Analysis Xin Fu Department of ECE University of Florida xinfu@ufl.edu Tao Li Department of ECE University of Florida taoli@ece.ufl.edu

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Reliability in the Shadow of Long-Stall Instructions

Reliability in the Shadow of Long-Stall Instructions Reliability in the Shadow of Long-Stall Instructions Vilas Sridharan David Kaeli ECE Department Northeastern University Boston, MA 2115 {vilas, kaeli}@ece.neu.edu Arijit Biswas FACT Group Intel Corporation

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures

An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures Wangyuan Zhang, Xin Fu, Tao Li and José Fortes Department of Electrical and Computer Engineering,

More information

WITH the continuous decrease of CMOS feature size and

WITH the continuous decrease of CMOS feature size and IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 5, MAY 2012 777 IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan, Student

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Reliable Architectures

Reliable Architectures 6.823, L24-1 Reliable Architectures Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 6.823, L24-2 Strike Changes State of a Single Bit 10 6.823, L24-3 Impact

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Eliminating Microarchitectural Dependency from Architectural Vulnerability

Eliminating Microarchitectural Dependency from Architectural Vulnerability Eliminating Microarchitectural Dependency from Architectural Vulnerability Vilas Sridharan and David R. Kaeli Department of Electrical and Computer Engineering Northeastern University {vilas, kaeli}@ece.neu.edu

More information

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Reorder Buffer Complexity Through Selective Operand Caching Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation

Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation Yue Luo, Ajay Joshi, Aashish Phansalkar, Lizy John, and Joydeep Ghosh Department of Electrical and Computer Engineering University

More information

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure

More information

ABSTRACT. Reducing the Soft Error Rates of a High-Performance Microprocessor Using Front-End Throttling

ABSTRACT. Reducing the Soft Error Rates of a High-Performance Microprocessor Using Front-End Throttling ABSTRACT Title of Thesis: Reducing the Soft Error Rates of a High-Performance Microprocessor Using Front-End Throttling Smitha M Kalappurakkal, Master of Science, 2006 Thesis directed by: Professor Manoj

More information

Efficient Program Power Behavior Characterization

Efficient Program Power Behavior Characterization Efficient Program Power Behavior Characterization Chunling Hu Daniel A. Jiménez Ulrich Kremer Department of Computer Science {chunling, djimenez, uli}@cs.rutgers.edu Rutgers University, Piscataway, NJ

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

SimPoint: Picking Representative Samples to Guide Simulation. Brad Calder Timothy Sherwood Greg Hamerly Erez Perelman

SimPoint: Picking Representative Samples to Guide Simulation. Brad Calder Timothy Sherwood Greg Hamerly Erez Perelman SimPoint: Picking Representative Samples to Guide Simulation Brad Calder Timothy Sherwood Greg Hamerly Erez Perelman 2 0.1 Introduction Understanding the cycle level behavior of a processor during the

More information

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CS 838 Chip Multiprocessor Prefetching

CS 838 Chip Multiprocessor Prefetching CS 838 Chip Multiprocessor Prefetching Kyle Nesbit and Nick Lindberg Department of Electrical and Computer Engineering University of Wisconsin Madison 1. Introduction Over the past two decades, advances

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Abstract Transient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor

More information

Outline. Emerging Trends. An Integrated Hardware/Software Approach to On-Line Power- Performance Optimization. Conventional Processor Design

Outline. Emerging Trends. An Integrated Hardware/Software Approach to On-Line Power- Performance Optimization. Conventional Processor Design Outline An Integrated Hardware/Software Approach to On-Line Power- Performance Optimization Sandhya Dwarkadas University of Rochester Framework: Dynamically Tunable Clustered Multithreaded Architecture

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency Program Phase Directed Dynamic Cache Reconfiguration for Power Efficiency Subhasis Banerjee Diagnostics Engineering Group Sun Microsystems Bangalore, INDIA E-mail: subhasis.banerjee@sun.com Surendra G

More information

Locality-Based Information Redundancy for Processor Reliability

Locality-Based Information Redundancy for Processor Reliability Locality-Based Information Redundancy for Processor Reliability Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov, zhou}@cs.ucf.edu

More information

Exploiting Type Information in Load-Value Predictors

Exploiting Type Information in Load-Value Predictors Exploiting Type Information in Load-Value Predictors Nana B. Sam and Min Burtscher Computer Systems Laboratory Cornell University Ithaca, NY 14853 {besema, burtscher}@csl.cornell.edu ABSTRACT To alleviate

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections ) Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each

More information

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001 Data-flow prescheduling for large instruction windows in out-of-order processors Pierre Michaud, André Seznec IRISA / INRIA January 2001 2 Introduction Context: dynamic instruction scheduling in out-oforder

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

Detecting Phases in Parallel Applications on Shared Memory Architectures

Detecting Phases in Parallel Applications on Shared Memory Architectures Detecting Phases in Parallel Applications on Shared Memory Architectures Erez Perelman Marzia Polito Jean-Yves Bouguet John Sampson Brad Calder Carole Dulong Department of Computer Science and Engineering,

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections ) Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches Sonia López 1, Steve Dropsho 2, David H. Albonesi 3, Oscar Garnica 1, and Juan Lanchares 1 1 Departamento de Arquitectura de Computadores y Automatica,

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Defining Wakeup Width for Efficient Dynamic Scheduling

Defining Wakeup Width for Efficient Dynamic Scheduling Defining Wakeup Width for Efficient Dynamic Scheduling Aneesh Aggarwal ECE Depment Binghamton University Binghamton, NY 9 aneesh@binghamton.edu Manoj Franklin ECE Depment and UMIACS University of Maryland

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

On Pipelining Dynamic Instruction Scheduling Logic

On Pipelining Dynamic Instruction Scheduling Logic On Pipelining Dynamic Instruction Scheduling Logic Jared Stark y Mary D. Brown z Yale N. Patt z Microprocessor Research Labs y Intel Corporation jared.w.stark@intel.com Dept. of Electrical and Computer

More information

Computing Architectural Vulnerability Factors for Address-Based Structures

Computing Architectural Vulnerability Factors for Address-Based Structures Computing Architectural Vulnerability Factors for Address-Based Structures Arijit Biswas 1, Paul Racunas 1, Razvan Cheveresan 2, Joel Emer 3, Shubhendu S. Mukherjee 1 and Ram Rangan 4 1 FACT Group, Intel

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

A Reconfigurable Cache Design for Embedded Dynamic Data Cache

A Reconfigurable Cache Design for Embedded Dynamic Data Cache I J C T A, 9(17) 2016, pp. 8509-8517 International Science Press A Reconfigurable Cache Design for Embedded Dynamic Data Cache Shameedha Begum, T. Vidya, Amit D. Joshi and N. Ramasubramanian ABSTRACT Applications

More information

ARCHITECTURE DESIGN FOR SOFT ERRORS

ARCHITECTURE DESIGN FOR SOFT ERRORS ARCHITECTURE DESIGN FOR SOFT ERRORS Shubu Mukherjee ^ШВпШшр"* AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO T^"ТГПШГ SAN FRANCISCO SINGAPORE SYDNEY TOKYO ^ P f ^ ^ ELSEVIER Morgan

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering

More information

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Using a Victim Buffer in an Application-Specific Memory Hierarchy Using a Victim Buffer in an Application-Specific Memory Hierarchy Chuanjun Zhang Depment of lectrical ngineering University of California, Riverside czhang@ee.ucr.edu Frank Vahid Depment of Computer Science

More information

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Cache Implications of Aggressively Pipelined High Performance Microprocessors Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University

More information

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract

Improving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract Improving Value Prediction by Exploiting Both Operand and Output Value Locality Youngsoo Choi 1, Joshua J. Yi 2, Jian Huang 3, David J. Lilja 2 1 - Department of Computer Science and Engineering 2 - Department

More information

SEVERAL studies have proposed methods to exploit more

SEVERAL studies have proposed methods to exploit more IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Using Hardware Vulnerability Factors to Enhance AVF Analysis

Using Hardware Vulnerability Factors to Enhance AVF Analysis Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan and David R. Kaeli ECE Department Northeastern University Boston, MA 02115 {vilas, kaeli}@ece.neu.edu ABSTRACT Fault tolerance

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems

Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems Tosiron Adegbija and Ann Gordon-Ross* Department of Electrical and Computer Engineering University of Florida,

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Use-Based Register Caching with Decoupled Indexing

Use-Based Register Caching with Decoupled Indexing Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register

More information

Versatile Prediction and Fast Estimation of Architectural Vulnerability Factor from Processor Performance Metrics

Versatile Prediction and Fast Estimation of Architectural Vulnerability Factor from Processor Performance Metrics Versatile Prediction and Fast Estimation of Architectural Vulnerability Factor from Processor Performance Metrics Lide Duan, Bin Li and Lu Peng Louisiana State University, Baton Rouge, LA 70803 {lduan1,

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Accurate & Efficient Regression Modeling for Microarchitectural Performance & Power Prediction

Accurate & Efficient Regression Modeling for Microarchitectural Performance & Power Prediction Accurate & Efficient Regression Modeling for Microarchitectural Performance & Power Prediction {bclee,dbrooks}@eecs.harvard.edu Division of Engineering and Applied Sciences Harvard University 24 October

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information