A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications

Size: px

Start display at page:

Download "A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications"

Rose Daniel
5 years ago
Views:

1 A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications by Harshad Sane B.S., University of Pune, India 2004 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Department of Electrical and Computer Engineering 2008

2 This thesis entitled: A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications written by Harshad Sane has been approved for the Department of Electrical and Computer Engineering Professor Daniel A. Connors Professor Manish Vachharajani Professor Li Shang Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

3 iii Sane, Harshad (M.S., Computer Engineering) A Framework for Analyzing Architecture-Level Fault Tolerance Behavior in Applications Thesis directed by Professor Daniel A. Connors, Ph.D. Radiation-induced transient faults, also known as single-event upsets, are a major concern in the design and operation of modern computer systems. Transient errors first impact the circuit and logic levels of a processor, and may propagate to the microarchitecture and architecture state of a processor. When undetected, transient errors in architecture state can lead to incorrect and undefined application behavior. Detailed simulation is a vital component of the design process of modern processors and exploration of new design concepts. However, high-level architectural simulators typically run x slower, making detailed simulation of most programs prohibitively long. Due to the deficiencies in simulation technology, current architecturelevel transient fault studies are primarily based on sampling techniques. Current program fault behavior analysis uses architecture-level injection of random bits selected over a time line. Such injection methods allow only a limited number of injections per unit time, a high percentage of which may not expose the true fault susceptibility nature of a program. This thesis introduces an accurate and fast fault-injection framework for studying inherent code execution properties of a program that correlate to different levels of fault susceptibility. The framework utilizes a three step approach consisting of trace logging, fault injection and replay execution to emulate the effect of a transient fault in the architectural registers of a running application. Finally by correlating the injection analysis results to dependence graph patterns across sets of applications, an analysis methodology is constructed to accurately predict the fault tolerance of an application without performing any error injections.

4 Dedication I dedicate this thesis to my family for their unconditional support.

5 v Acknowledgements Firstly I would like to thank my adviser, Professor Dan Connors who guided me through my academic endeavor at CU. I would like to thank him for all the knowledge and encouragement he bestowed upon me. This work has been possible because of his guidance and prompt feedback. I would like to thank all the members of the DRACO research group for their insight and ideas. I would like to thank my friends who made sure that I enjoyed life along with my career. Most importantly, I would like to thank my family who have supported me unconditionally throughout out my life.

6 vi Contents Chapter 1 Introduction Contributions Background Single Event Phenomenon Architecture-level Fault Analysis Motivation Overview of current injection methodologies Fault penetration and point of injection Natural fault resilience and ineffective injections Fault Emulation Framework and Results Overview and Goals The TEFIS Framework Execution Tracing Fault Generation Fault Emulation Trace Emulation Experimental Results and Analysis Fault tolerance categories

7 4.3.2 Accuracy of emulations Average Execution Time of Fault Emulations Against Full Injections 24 vii 5 Analysis of Fault Tolerance Program Behavior Source Code Analysis Algorithm Level Effects Dynamic Source Code Behavior Dynamic Program Trace Behavior Inter-Procedural Fault Tolerance Dynamic Dependence Graph Representation Estimating Program Fault Tolerance Future Work 38 7 Conclusion 40 Bibliography 41

8 viii Tables Table 3.1 References Fault tolerance characteristics of logical operations

9 ix Figures Figure 2.1 A neutron strike Feature size Vs Soft error rate Current random injection methodologies Fault penetration Fault injection distribution in time A high fault tolerant case from 186.crafty with 80.6% correct results of emultion Fractional emulation captures the effect of fractional execution with traces against full injections in the entire application. The figure shows the exclusion of build-up and monitoring time Context dump Framework overview Fault emulation Result categories Emulation accuracy A comparison of execution time Fault tolerance analysis of sorting routines (a)heapsort (b)quicksort A trace from 164.gzip with 91.4% incorrect results of emulation

10 x 5.3 A trace from 300.twolf with 100% segmentation fault results of emulation Fault tolerance correlation with program counter similarity Fault susceptibility correlation calculated using similar program code points Inter-procedural fault tolerance Example: Dependency graph Dependence graph similarity correlation Dependence graph similarity correlation - All benchmarks Fault tolerance prediction

11 Chapter 1 Introduction Scaling trends in technology that lead to faster and smaller transistors, lower voltage levels and reduced noise margins also increase the susceptibility of a circuit to transient faults. Shielding systems in hardware from radiation, cosmic rays and cross talk is difficult from a high-speed design perspective and is costly in terms of active power consumption. Transient errors first impact the circuit and logic levels of a processor, and can propagate to the micro-architecture and architecture state. Architecture state errors lead to invalid and unpredictable software behavior. As many program phases are more tolerant of some single-bit architecture errors, software shielding becomes an exciting solution due to its low cost and flexibility. Many software-based fault tolerant techniques [24, 23, 10] have been proposed to balance performance with error detection and recovery. Current studies involving the observation of effects of transient faults utilize a limited number of random fault injections in which micro-architecture (pipeline registers, intermediate logic) or architecture (register, memory) state is modified during simulated execution. As these error injection campaigns involve emulating machine execution, each injection run includes substantial experimental time leading up to the point of injection, and the remaining execution of the application to determine program correctness. Collectively the excessive time for simulated injections limit the points of

12 2 the program execution that can be studied, thus reducing the significance of understanding the fault tolerance behavior in applications. There are a number of ways to improve the accuracy and execution time of fault analysis over traditional fault injection systems. Firstly, as applications are characterized by repeating phases [5], there are opportunities to reduce the number of fault injections by studying representable phases of execution. Furthermore random injects often do not expose relevant program behavior as significant portions of program execution involve dead code [6] and value locality [16]. There is substantial potential to model fault behavior of code sequences by correlating the results of fault injections to dynamic code regions. In this way, a fault analysis modeling framework can be constructed anticipate the fault susceptibility of an application based on execution profile of code regions which do not require fault injections. 1.1 Contributions This thesis presents an experimental study of the current architecture-level injection techniques for evaluating fault tolerance. Based on this study, the first half of the thesis motivates the need and design of a new infrastructure for fault injection methodology. The framework is based on trace logging and performing all possible architecture faults of the trace through an injection emulation system (TEFIS). An analysis of the approach order to improve the accuracy and timeliness of the system is examined. The next section focuses on correlating the properties of a program with inherent fault tolerance characteristics. Based on this correlation, an experimental model is constructed to predict the fault tolerance of a program without any injections. This model would provide a baseline for making fault tolerance predictions of application. The above points are encompassed as contributions. These are the following contributions made to this thesis:

13 3 (1) Development of a fast and accurate fault modeling framework: A new methodology for evaluating the transient fault tolerance of program regions is presented. The framework deploys re-play execution of selected program traces that provide the same accurate results of full-scale fault injection in a fraction of the experimental evaluation time. (2) Demonstration of correlation between fault tolerance and program structure: The fundamental code property of the dynamic dependence graph of architecture state is analyzed to expose patterns exhibiting various levels of fault susceptibility. Case studies are examined to reveals relations between source code structures and fault tolerance behavior. (3) Fault tolerance prediction based on studied program behavior : Estimation of fault tolerance of an application by constructing an analytical model to accurately assess fault tolerance to code regions. The following sections elaborate on each of thesis contributions. The thesis concludes by proposing possible enhancements to the framework and scope for future work with this system.

14 Chapter 2 Background 2.1 Single Event Phenomenon Radiation effects on processors is a major concern for architects with reduction in transistor features. Among these effects, bit flips resulting from the ionization by neutron strikes from cosmic rays and alpha particles are considered critical source of upsets owing to their random occurrence. These effects, called Single Event Upset (SEU), constitute a serious threat to the reliability of digital equipment built on advanced circuits. Single event phenomena can be classified into three effects: I Single event upset (soft error) II Single event latch up (soft or hard error) III Single event burnout (hard failure) Single event upsets is defined by NASA as radiation induced errors in microelectronic circuits when charged particles lose energy by ionizing the medium through which they pass, leaving behind a wake of electron-hole pairs [2]. These electron hole pairs generate charge as they combine, and if this charge is greater than the critical charge of the device, it results in a change of state. Transient faults fall under this category of SEUs. Figure 2.1 is an example of a neutron strike on a transistor.

15 5 Figure 2.1: A neutron strike. Transient faults are emerging as a major concern for architects designing reliable computer systems [3, 17]. Trends in silicon process technology report a bleak future in terms of the fault susceptibility of an application. While the future error rate of a single transistor is expected to stay relatively constant [11, 13], the number of transistors per chip continues to increase at an exponential rate. As a result, the overall error rate for processors is expected to increase dramatically, making fault tolerance as important a design characteristic as performance, power consumption, and temperature dissipation. Figure 2.2 shows the effect of reduction in feature size with years to come. The degradation rate is about 8% per bit per generation [4] and follows the curve as shown in the figure. Hardware designs can be customized for fault tolerant execution with redundant resources such as latches or extended pipelines. Providing fault tolerance may require the addition of hundreds of thousands of delay latches and 20-30% logic to an existing processor [28]. Other more specialized approaches create even more sophisticated systems requiring both hardware and software integration [1, 31]. While these approaches work well in their specific scientific computing domain, the general purpose design field must adapt to the need for fault tolerance in fundamentally different ways. As design

16 6 cycle time is critical, many chip designers propose implementing redundancy-based fault tolerance using existing multi-core and multi-threaded process extensions [9, 18]. The driving motivation is to extend the engineering decision towards multi-context processor to provide fault tolerance. Figure 2.2: Feature size Vs Soft error rate. 2.2 Architecture-level Fault Analysis Most recent architecture research is focused on using performance models to provide Architecture Vulnerability Factor (AVF) estimates of processor reliability rather than deploying detailed fault injection into hardware RTL models. AVF is defined as the probability that a fault in that particular structure will result in an error in the final output of a program [19]. A structure s error rate is the product of its raw error rate, as determined by process and circuit technology, and the AVF. Processor designers can use AVF analysis to determine the processor structures in probabilistic need of protection (e.g., structures with high AVF are likely to be protected). Some structures, such as the branch predictor, have no effect on whether an error will propagate to the output of the program. In contrast, other structures are on the opposite end of the spectrum such

17 7 as the instruction issue window, load-store queue, and re-order buffer. The majority of hardware structures fall in the middle of the two extremes. While AVF analysis provides support for investigating new fault tolerant architecture techniques, program execution characteristics are largely missing from the determination of periods of software error susceptibility. A software-centric view makes this key insight: although faults occur at the hardware level, the only faults which matter are the faults which affect software correctness. By changing the boundaries of output comparison to software, a software-centric model shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, only faults which affect correctness are detected. Benign faults are safely ignored. A software-centric system with both detection and recovery, will not need to invoke the recovery mechanism for faults which do not affect correctness. The primary problem with AVF is that software periods of vulnerability substantially differ from micro-architecture periods of vulnerability. As research trends dictate, finding ways to selectively enable transient fault tolerant mechanisms, run-time and off-line experimental techniques must be guided equally by program behavior and hardware. As such, it is important to determine and predict when program susceptibility and hardware susceptibility differ.

18 Chapter 3 Motivation 3.1 Overview of current injection methodologies A major problem in the development of fault-tolerant systems is the accurate determination of the dependability properties of the system. Unlike performance, which can be evaluated through the use of benchmark programs, the degree of fault tolerance and reliability of a system cannot be evaluated in such a manner. This is because we do not often have the luxury of allowing systems to run for a very long time to see their behavior under fault effects. The generally preferred solution to this problem is to inject the effects of faults in a simulation model or a prototype implementation, and to observe the behavior of the system under the injected faults. Fault injection in a simulation is very flexible but far too time consuming. On the other hand, it is much more difficult to inject accurate (i.e. realistic) faults into a prototype, though their effect is readily observable. Figure 3.1 shows the current methods of fault injection, where random injection procedure has been adopted. The following graph shows the number of injections per benchmark used in recent papers that adopted random fault injection methodology. Table 3.1 are the list of references for each of the published articles numbered in Figure 3.1.

19 9 Figure 3.1: Current random injection methodologies. Number Short Description Reference (1) Soft-error detection through software TFT techniques(99) [22] (2) Y-Branches(03) [29] (3) Characterizing TF effects on processor pipeline(04) [30] (4) Configurable TF detection via dynamic binary translation(06) [24] (5) Symplfied(08) [20] (6) Using PLR to exploit multi cores for TFT(07) [27] Table 3.1: References A recent paper focusing on instruction level error derating adopted an interval injection methodology [7]. The injection campaign includes 100 uniformly distributed points of injection in trace lengths of 100 instructions, in 32 as well as 64 bit registers, resulting up to 224 injections per instruction. This method does show a representative set of experiments without having to simulate the entire benchmark. 3.2 Fault penetration and point of injection Fault injections can be performed at geometric layout, circuit, gate or block level models. The block level model is a functional view defining data and control paths of the application. Logic gates can go through several levels of masking like electrical and

20 10 latching window and logical before they can affect the behavior of an application [26]. From a user point of view, it only matters if a transient fault causes undesirable effects in the application. Hence our study can utilize fault injections at the block level as long as they emulate the same effect of the propagation of a hardware fault. The process of injecting faults into the architectural registers captures the above notion as shown in figure 3.2, although logical masking would still persist between the architectural and application layers. Figure 3.2: Fault penetration. Current injection techniques corrupt a single bit at 1000 random execution points [25, 27]. As discussed previously, such tests do not regard program behavior and have substantial variation. Figure 3.3 demonstrates the cumulative time for 1000 random fault injections. The injection campaign time is sorted to account for the longest running injections then the shorter injections. Some of the runs only take a matter of seconds, while others take several minutes to complete to understand whether the program s behavior was changed. Based on Figure 3.3, the view of performing statistically significant fault injection using random or interval based schemes would require substantial computation efforts. Clearly the use of fault injections into program state must be strategically guided to gather fault outcomes for only certain regions of interest.

21 11 Figure 3.3: Fault injection distribution in time. 3.3 Natural fault resilience and ineffective injections As stated above though electrical and latching window masking is covered by injecting faults at the architectural level, logical masking of faults are overlooked in random injection methods. Examples of logical masking effects include logical operations, conditional operations, overwriting faulted values before use, binary return values, dynamically dead instructions [6], silent stores [15] etc. This is why a large number of random injections do not expose the fault susceptibility of a program. For example, the table 3.2 shows the probable levels of tolerance with logical operations at bit level. The table compares the logical operation of a register either with itself or with another operand. The effects of the values presented in this table can be realized from their truth tables. From the table it can be seen that an XOR operation of a register on itself has 100% fault tolerance since it simply clears the register. Hence any bit perturbation before this operation would have no effect whatsoever. Figure 3.4 is an example of logical masking obtained from a trace of 186.crafty. The figure shows the source code of a function from the benchmark along with its

22 Operation Other Operand Itself AND,OR,NAND,NOR 50% 0% XOR,XNOR 0% 100% NOT 0% - 12 Table 3.2: Fault tolerance characteristics of logical operations. dynamic block execution. It can be seen from the source code, that all this function does is return a binary value (true or false) based on the movement of a white or black piece in a game of chess. The analysis shows that with such a dependence flow, the effect of flipping bits would hardly impact the return values. Random methods of fault injection impose a limitation on the number of injections per unit time, a high percentage of whose susceptibility may be low as per the above example. These limitations tend to invoke a motivation in developing a system which could not only be folds of magnitude faster but also as accurate as the full injections. Besides it should also be able to capture the properties of the program that affect its fault susceptibility. TEFIS (Trace emulation fault injecting system) is a powerful technique for emulating hardware faults using software and uses PIN, a binary instrumentation tool for tracing and emulation, which form the two major portions of the framework. Traces are captured along with their state information from a running executable based on various user defined parameters. This is followed by a rigorous fault injection procedure and replay execution of the faulted trace, but only up to the length of the trace. This approach provides the desired flexibility, and at the same time, allows execution of many experimental runs in a relatively very short period of time.

23 13 (a) Source code (b) Dynamic control flow Figure 3.4: A high fault tolerant case from 186.crafty with 80.6% correct results of emultion.

24 Chapter 4 Fault Emulation Framework and Results 4.1 Overview and Goals Software implemented fault injection methodologies can be broadly classified under compile time and run time injections. A run time fault injection system would truly emulate the effect of a real time transient fault occurrence. Current methods of run time software fault injection are either time based or interrupt driven. These are program level fault injection techniques which are generally unguided. Faults are injected either randomly, interval based or within phases of the application. Phase based injection would exploit some of the properties of an application. In either case, these methods would suffer from the following drawbacks: Experimental time for a single injection : This time consists of two parts, the build up time and monitoring time. Build up time would be the time required for full execution up until fault point, while monitoring time is what follows the injection until an outcome. Multiplied impact of the single injection overhead : There is a limitation in studying the complete application due the single injection overhead time multiplied over the number of injections applied to the application. For evaluating

25 15 the fault tolerance of the application, a statistically significant amount of injections are necessary, which consumes a lot of time, limiting the exposure of the application to a fewer number of faults. Inaccuracy in determination of dependability properties of a system : Due to the combined effect of he above two points, limited number of injections provide low determinism in understanding the dependency behavior of the system under all kinds of fault behavior. Keeping the above limitations in mind, there is a requirement for a system which overcomes the need to run full executions of the application for each injection, as well as explore the dependability properties of a system that affect its fault susceptibility. The trace emulation fault injecting system (TEFIS) has been developed with the following goals in mind: (1) Accuracy in fault tolerance with fractional execution : This frame work shall employ fractional traces for execution within the binary. Traces of any length can be obtained at any time point within the application. Accuracy is determined by comparing results of this fractionally injected execution to those of full injections. Figure 4.1 shows a comparison between full injection and fractional emulation on trace lengths of 100 and 200 instructions, where the dark regions show the fractional length of execution. (2) Very high speed injections and summary generation : Length of a single run extends from the point of injection to the end of the trace. This eliminates the build up time before the point of injection and monitoring time after the end of the trace, limiting the execution time to only the number of instructions per trace. This method not only speeds of result collection but also provides room for a large number of emulations.

26 16 Figure 4.1: Fractional emulation captures the effect of fractional execution with traces against full injections in the entire application. The figure shows the exclusion of buildup and monitoring time. (3) Expose deeper understanding of program behavior : Program properties have been studied based on their possible impact on the fault susceptibility of the application. Two methods of correlation have been described and used in this analysis, code region similarity and data dependence similarity. 4.2 The TEFIS Framework The framework has been designed to have functionally independent units. Each of these units are presented in the order in which they are executed within the framework. The three units of this procedure are execution tracing, fault generation and fault emulation. The tracing and emulation system uses PIN [21], a binary instrumentation tool developed at Intel. PIN is used in this framework for generating and loading context information at an instruction level.

27 Execution Tracing Execution tracing is a one time procedure involving generation of traces from the binary with a number of controllable parameters of operation. The framework provides operational flexibility on tracing methods with the following options: Length - Traces of any length can be generated Time Any point in time Interval - uniform intervals with uniform lengths Phase - Integration with SIMPOINTS [5] Function specific tracing This system uses a tool called ExecutionTracer for this purpose. The Execution- Tracer is a PIN [21] tool capable of dumping snippets of the binary at run time with predefined knobs for the user to control the length and starting point of the trace. Once these parameters are passed in, every instruction is instrumented before its execution. This instrumentation constitutes gathering of the following information: Trace context : Disassembly with instruction pointer which give control flow information Register context : Value stored within each of the 32 bit registers Memory context : Any values read from or written to memory along with their addresses. Dependence graphs : Data flow dependence information generated in graphical format with the help of dot [12]

28 18 Edge context : Edge information at each basic block (Not used as of now) An example format for context dumps within a trace is shown in Figure 4.2. The figure shows a register context which is generated for each instruction, a memory context if exists for that instruction and the disassembly for that instruction, in the form of a trace context with its instruction pointer information. The conventions for storing each context have the context name followed by the trace number. For trace generation these files form static storage, but for fault emulations, they are created and destroyed on the fly as required. Figure 4.2: Context dump Fault Generation Fault generation follows the one time tracing procedure. This is a highly iterative process that repeats a number of times in a trace. The bit flips required for fault emulation are generated in this step. The tool used for bit flips needs to unfold through a number of stages which are listed sequentially as follows: Trace - A trace generated from the binary forms the top level entry Instruction - Each instruction within a trace forms the next level of entry

29 Source registers - Every 32 bit source register for an instruction, if present forms the next level of entry 19 Bits - Every bit in the 32 bit source register is flipped and forms a single emulation Figure 4.3 gives an overview of the tracing and fault generation procedure. The right half of the figure depicts the one time tracing system, while the left hand side shows the fault generation and emulation steps. The box in the middle shows a possible future step of less susceptible instruction filtering. The right half block has three columns representing the binary, a trace and the context. The binary is shown to have been split into traces, with each trace having a number of instructions and each instruction having a context associated with it. The fault generation portion is associated for a single trace. The figure shows number of emulations for each instruction in the trace. Also within each instruction, 32 flips per source register are shown for each emulation. This structure repeats over all the traces within the executable. Faulted contexts are saved for each emulation in a flat directory structure with extensions to each file name representing the trace and emulation it belongs to. This step prepares the faulted contexts for their actual runs during fault emulation Fault Emulation This step runs the binary loaded with the faulted context. Each run consists of a context with a single bit flipped from within a source register in an instruction within the trace. Though there are a large number of runs of the executable, their execution length is limited only to the end of the trace to which the emulation belongs. Faulted register contexts from the fault generation stage along with the memory context as dumped during tracing are loaded at run time using instrumentation via PIN [21].

30 20 Figure 4.3: Framework overview. Since the instruction pointer is known from the trace, the exact dumped contexts at an instruction are loaded at the exact same point, but with one of the source register bits flipped. After this initial loading the binary is run from this state for the same number of instructions that follow it in the original trace. This process is repeated over all emulations till half of the trace length. i.e. First 50% instructions within the trace. The framework lets each trace represent a folder with all the emulations for the trace residing in that folder. Figure 4.4 shows the fault emulation system with the reference trace on the left and the emulation trace on the right. The reference trace is shown for the purpose of result comparison. The figure shows a point of injection, which consists of loading of the memory context and the faulted register context based on the context gathered for the same instruction in the reference trace. When the binary is run with the state set in such a manner, there are five different possibilities in execution. Data flow deviation: This results in a context mismatch with or without a control flow deviation Control flow deviation: Instruction pointers show a mismatch in the traces

31 21 Figure 4.4: Fault emulation. generally always accompanied by a data flow deviation as expected, since a totally new set of instructions are encountered. Signal Fault: Trace execution aborts due to inappropriate memory accessing signaled as an error from the operating system. Timeout: Cases where the trace gets stuck in a loop or hangs, which is when execution is forced to halt based on a set timer. These cases are generally very rare due to the design of the system. None: The emulation exactly matches the reference trace in all contexts The results are based on comparing the contexts of the reference trace with the emulated trace at the end of execution of the last instruction. The results are placed in bins indicating mismatches, signal faults or correct execution. 4.3 Trace Emulation Experimental Results and Analysis 12 sets of integer benchmarks from SPEC2000 suite were used as candidate applications for the fault emulation experiments. Full injections were also performed on

32 the same set of traces that were used for emulation purpose traces per benchmark 22 were generated for the experiment. Trace lengths of 100 and 200 instructions were selected and emulations were run across half the length of each trace. Since the start of the trace lengths of 100 and 200 instructions were the same, the 200 trace length covers the characteristics of the prior and more Fault tolerance categories Each emulation was compared with the respective reference trace on the fly, and results were generated. The basis of grouping the results of emulation can be summarized in the following manner: Mismatch: This case covers a mismatch in either the register, memory contexts or control flow Signal Fault: The emulation tool is equipped with handlers to catch any signal faults from the operating system. None: All contexts and control flow match at the end of the trace. Figure 4.5 shows the fault tolerance characteristics of each benchmark. The figure depicts fault tolerance based on the overall categories of results mentioned above. The groupings in the left figure are made based on a cut-off percentage. i.e Mostly incorrect signifies over 50% incorrect with the number of correct entries less than 10% in that trace. Same is the case for the segmentation fault case, while the Correct bin accumulates all correct entries over 10% in the trace. This figure shows an overall characteristic of each benchmark displaying its fault tolerance behavior. From the figure it can be seen that 186.crafty and 256.perlbmk show a high fault tolerant behavior.

23 Figure 4.5: Result categories. 4.3.2 Accuracy of emulations As described previously the goals of the framework is to prove two contributions; accuracy of the frame work and speed of execution

33 23 Figure 4.5: Result categories Accuracy of emulations As described previously the goals of the framework is to prove two contributions; accuracy of the frame work and speed of execution based on comparison with full injections. For this purpose the same traces used for emulation were taken and full injections were performed on each bit of every source register found in every instruction of the trace. Since the full injections run the complete binary and base the results on entire execution, the results of emulation are compared against these for determining accuracy. Figure 4.6 shows the accuracy of emulations with two trace lengths against full injections. The figure shows 3 vertical bars per benchmark, the first two being emulations of trace length 100 and 200 instructions respectively, while the last bar showing full injections. The categories are in the same order as mentioned before. The Y axis shows the percentage contribution per category. The accuracy of emulations is determined by the closeness of the correct entries for the 100 and 200 instruction trace length emulations to the full injections. From the figure it seems that the accuracy is fairly good for trace lengths of 100 and 200 instructions. Though application dependent, the real injection data seems fairly close to the emulation accuracy observed with above

34 24 trace lengths. An enhancement to this would include trace lengths of increasing order above 200 instructions. An experiment was conducted to monitor the position of the control deviation due to fault injections. It was observed that a control flow would occur generally between 100 to 200 basic blocks. That gives an approximate length of 1000 instructions considering an average of 5 instructions per basic block. A saturating curve for correctness should be observed with increasing length of traces, somewhere upon this value. This would follow an analysis to see whether an increasing trend of correct entries is observed closer to the full injections with this method. Figure 4.6: Emulation accuracy Average Execution Time of Fault Emulations Against Full Injections Figure 4.7 shows a comparison between the times of execution of 1000 emulations to the same number of full injections. Notice that the Y axis is on a logarithmic scale. It can be seen that the execution times for emulations are much smaller than the full

35 injections. Also this execution time for emulations remains constant over all benchmarks since the length of instructions executed remains more or less the same. 25 Figure 4.7: A comparison of execution time. The above two results are the most important which help in proving the accuracy of the emulation system as well as its ability for having execution times folds of magnitude lesser than that for full injections. The system also has the flexibility to add the extra block for ineffective instruction filtering which would even further enhance its ability to capture the fault susceptibility of an application as well as reduce execution times.

36 Chapter 5 Analysis of Fault Tolerance Program Behavior The above sections have proved the accuracy and reduced execution time of the emulation framework. This section would look in to correlating the observed fault tolerance obtained from the system to the program behavior. This is necessary in understanding the relation between the properties of a run time program and its fault susceptibility. The criteria applied for selecting such properties is based on the assumption that similar code execution should have similarity in its fault susceptibility. 5.1 Source Code Analysis Finding properties of a program that affects its fault tolerance require a much deeper understanding of what is going on within the source. An application may be analyzed by either looking at its algorithm level or dynamic assembly to understand its fault susceptibility. This section shall look at algorithmic effects on fault tolerance as well as do a source code analysis in order to understand program behavior from a reliability point of view Algorithm Level Effects To view an algorithm based effect on fault tolerance, fault injections were applied to sorting routines, which basically performed the same function, using the same inputs

optimizations across each. A portion of the results are displayed in Figure 5.

37 but applied different algorithms. These applications were also compiled under different 27 versions to observe the effect of optimizations across each. A portion of the results are displayed in Figure 5.1 to show the effect of algorithm on the fault tolerance of an application. (a) Heapsort (b) Quicksort Figure 5.1: Fault tolerance analysis of sorting routines (a)heapsort (b)quicksort.

38 Dynamic Source Code Behavior A PIN [21] tool was written for the purpose of tracking the dynamic flow of instructions in basic blocks along with their disassembly. The tool is capable of taking in a function name and dump a static control flow graph with markings for its dynamic execution. The function names were provided from the functions that occurred in the traces, which were dumped using a script that used another simple PIN [21] tool. This gives an idea of what the trace is actually executing and a complete view of how a fault propagates. Sections of these traces were processed through this tool to analyze which instructions the fault encountered during its flow dynamically through the program. A trace having fault tolerance has already been illustrated in the motivation section. Further cases of traces with low fault tolerance and high segmentation fault cases in this section. Figure 5.2 shows a case of low fault tolerance, about 91.4% incorrect behavior, in 164.gzip. This is a function called updcrc() from the trace which does a crc check and hence is compute intensive. Comparing this layout to the high fault tolerant case, it does not have binary decisions or logic to mask values off. Instead the operations just before returning from the function perform an xor with all F s which necessarily flips all bits. From the context point of view this information will never match after a fault injection. Figure 5.3 shows a case of high segmentation fault. In this case it is a function from 300.twolf. The source code shows that this function performs a linked list traversal, within which it assigns values. Also, most of the assignments use pointers, hence a very likely candidate for misalignment in memory accesses. The study shown above, requires properties closer to the source, based on their impact on fault tolerance. Two such properties and their effects on the fault susceptibil-

39 29 /* Source function */ ulg updcrc(s, n) uch *s; /* pointer to bytes to pump through */ unsigned n; /* number of bytes in s[ ] */ { register ulg c; /* temporary variable */ static ulg crc = (ulg)0xffffffffl; /* shift register contents */ if (s == NULL) { c = 0xffffffffL; } else { c = crc; if (n) do { c = crc_32_tab[((int)c ^ (*s++)) & 0xff] ^ (c >> 8); } while (--n); } crc = c; return c ^ 0xffffffffL; /* (instead of ~c for 64-bit machines) */ } (a) Source code (b) Dynamic control flow Figure 5.2: A trace from 164.gzip with 91.4% incorrect results of emulation. ity of a program have been chosen. The following sections analyze code region similarity and data dependency of a program as tools to correlate its fault tolerance. 5.2 Dynamic Program Trace Behavior This section analyzes the behavior of a program based on similarity of code regions and its effect on fault tolerance of the application. The analysis method comprises of comparing the existing traces with each other, in an attempt to find matches in program counter values. This gives an indication of the program executing in the same code region but probably in a different dynamic state.

40 30 (a) Source code (b) Dynamic control flow Figure 5.3: A trace from 300.twolf with 100% segmentation fault results of emulation. The correlation with fault tolerance is based on generating a score of similarity among a pair of traces. This score would give an indication of how similar the two traces are. The score generation procedure comprises of taking two traces, finding the percentage of matching program counters from one trace to the other and vice versa, and averaging their value to generate a similarity score. This procedure is iterated in pairs of traces among all the traces. Figure 5.4 shows the overall correlation of fault tolerance to PC similarity across all benchmarks. The different bins represent the delta

41 31 in fault tolerance with similarity scores of above 65%. Figure 5.4: Fault tolerance correlation with program counter similarity. To observe the correlation, the difference in the fault tolerance of a pair of traces with similarity scores of 65% and above are plotted as shown in figure 5.5. The figure shows program counter similarity score plot against a function of the fault tolerance delta between the pair of traces. The curve observed is a polynomial fitting curve of third degree depicting a trend of the susceptibility of a program to faults with similarity of code regions. There are 2 observable trends from the figures of 175.vpr and 181.mcf which show a very promising decrease in the fault tolerance delta with increasing score of similarity. On the other hand compression algorithms like 164.gzip and 175.vpr show a very irregular behavior. This could be due to the large variations in the dynamic states, though similar instructions are being executed, but at a different time. Also these applications tend to have more inter-procedural calls than others Inter-Procedural Fault Tolerance

42 32 (a) 175.vpr (b) 256.bzip2 (c) 181.mcf (d) 164.gzip Figure 5.5: Fault susceptibility correlation calculated using similar program code points. The existing trace information was used to correlate the effect of number of procedures in a trace to its fault susceptibility. As expected, a decreasing trend of fault tolerance was observed with increasing number of inter procedural calls within a trace. Figure 5.6 shows the average fault tolerance across all traces with increasing number of procedures per trace. The figure also shows the percentage of times these occur among a set of 1000 traces and is scaled with a factor of 0.2 for visibility in the same region. The occurrence of single functions within traces dominates the rest, showing fewer traces with higher number of procedures having the highest fault susceptibility. The above study gives us some indication that though certain applications show a favorable trend of fault tolerance with code similarity, others depend on the dynamic state of the system. This is taken into consideration in the following section where data

43 33 Figure 5.6: Inter-procedural fault tolerance. dependencies which are the primary propagators of a fault are analyzed and compared among traces. 5.3 Dynamic Dependence Graph Representation Program counter correlation captures the similarity in terms of functional execution, but fails to capture the dynamic data flow, which propagates or masks the effect of a transient fault. The propagation of a fault depends on the data dependencies of a program, and this information needs to be captured and used to find similar trends, in order to correlate with the fault susceptibility of an application. Figure 5.7 shows an example of a data dependency graph in dot format. The reliability of the system is inherent in the connectivity of the graph. For this purpose, data dependency graphs are generated from the traces. The graphs are generated by a graph clustering tool developed by Dennis Sasha at NYU [8]. The tool takes a data set of directed graphs generated by a PIN [21] tool which instruments and dumps data dependency information. The graph clustering tool uses SUBDUE [14] to find common sub-structures in a given trace. The tool then iterates

44 34 Figure 5.7: Example: Dependency graph. over all traces to find t similar structures and clusters them together. The tightness measure of a cluster which defines the precision of similarity can also be defined. The above graph clustering procedure was iterated over all the generated traces. A similarity score was calculated among a pair of traces based on the closeness of the dependence graph contained by each trace. This analysis directly correlates the fault tolerance with the dynamic dependencies of the program, which is the active ingredient in fault propagation. For a larger view, similarity scores of 35% and above have been plotted against their respective fault tolerances. The graph on the left gives the actual raw number while that on the right is a polynomial fitting of the graph on the left. The cases in Figure 5.8 are the same as shown for program counter similarity, yet show a decreasing trend in fault tolerance with increasing similarity score of dependences. A similar trend was observed among all the other benchmarks with very few outliers. Figure 5.9 shows the dependence similarity correlation over all the benchmarks. from the figure, it can be observed that the trend remains similar averaged over all the

35 (a) 164.gzip. (b) 164.gzip - Trend. (c) 175.vpr. (d) 175.vpr - Trend. Figure 5.8: Dependence graph similarity correlation. (a) All Benchmarks. (b) All Benchmarks - Trend. Figure 5.9: Dependence graph similarity correlation - All benchmarks.

45 35 (a) 164.gzip. (b) 164.gzip - Trend. (c) 175.vpr. (d) 175.vpr - Trend. Figure 5.8: Dependence graph similarity correlation. (a) All Benchmarks. (b) All Benchmarks - Trend. Figure 5.9: Dependence graph similarity correlation - All benchmarks. benchmarks. The delta in fault tolerance shows a decrease with increasing similarity. Although a slight rise in the curve can be observed in the end.

46 Estimating Program Fault Tolerance Based on the two correlation techniques seen in the previous sections, it is pretty evident that dependency graph similarity is a much more accurate analysis of correlation with fault tolerance. This technique is put to use in prediction of fault tolerance based on the constant trends of fault susceptibility seen against graph similarity scores. If the prediction accuracy does fall under an acceptable range, one could only look at the dependence structure of any program and be able to predict its fault tolerance without any injections. The existing results for the traces from the emulation framework along with their dependency graphs were used to make prediction models. The prediction model used here uses incremental number of graphs which are most similar to the reference graph, in order to predict its fault tolerance. The fault tolerance values are averaged for graph number more than one. Ten models have been selected for this prediction, where the first model predicts the fault tolerance of a trace based on the fault tolerance of another trace which has the most similar dependency graph. The models greater than two, select the designated number of most similar graphs and average out their fault tolerance score for prediction. The accuracy of the prediction model is observed with increasing number of similarity graphs in use. Figure 5.10 gives the accuracy of each prediction model averaged over all the traces on 12 benchmarks. The X axis represents the model with the number of similarity graphs used for its prediction and the Y axis represents the average accuracy of prediction. Since the Y axis is a difference between the actual and predicted value of fault tolerance averaged over all traces, a value closer to the X axis represents higher prediction accuracy. Based on the figure, it is evident that the accuracy in prediction of fault tolerance

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection