RECONFIGURABLE FAULT TOLERANCE FOR SPACE SYSTEMS

Size: px

Start display at page:

Download "RECONFIGURABLE FAULT TOLERANCE FOR SPACE SYSTEMS"

Poppy Reeves
5 years ago
Views:

1 RECONFIGURABLE FAULT TOLERANCE FOR SPACE SYSTEMS By ADAM M. JACOBS A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013

2 c 2013 Adam M. Jacobs 2

3 To my parents for all of their patience and support 3

4 ACKNOWLEDGMENTS This work was supported in part by the I/UCRC Program of the National Science Foundation under Grant No. EEC and IIP The author gratefully acknowledges vendor equipment and/or tools provided by various vendors that helped make this work possible. The author also thanks fellow graduate student, Grzegorz Cieslewski, for developing the FPGA fault-injection tool used to gather many of the experimental results for this work. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION BACKGROUND AND RELATED RESEARCH FPGA Performance and Power Efficiency Partial Reconfiguration Single-Event Effects FPGAs in Space Systems Low-Overhead Fault Tolerance Methods Algorithm-Based Fault Tolerance Task Scheduling FRAMEWORK FOR RECONFIGURABLE FAULT TOLERANCE RFT Hardware Architecture RFT Controller Operation MicroBlaze Operation Environment-Based Fault Mitigation RFT Controller Resource and Performance Overheads RFT Fault-Rate Model RFT Performability Model Results and Analysis Validation Case Study Orbital Case Studies Low-Earth orbit case study Highly-elliptical orbit case study Conclusions ALGORITHM-BASED FAULT TOLERANCE FOR FPGA SYSTEMS Matrix Multiplication Checksum-Based ABFT for Matrix Multiplication Matrix-Multiplication Architectures Baseline, serial architecture Fine-grained parallel architecture

6 Coarse-grained parallel architecture Architectural modifications for ABFT Resource-Overhead Experiments Resource overhead of serial architectures Resource overhead of parallel architectures Fault-Injection Experiments Design vulnerability of serial architectures Design vulnerability of parallel architectures Analysis of Matrix-Multiplication Architectures Fast Fourier Transform Checksum-Based ABFT for FFTs FFT Architectures Radix-2 Burst-IO FFT architecture Radix-2 Pipelined FFT architecture Architectural modifications for ABFT Resource-Overhead Experiments Resource overhead of Burst-IO architecture Resource overhead of Pipelined architecture FFT Fault Injection Analysis of FFT Architectures Conclusions RFT SYSTEM INTEGRATION FOR RAPID SYSTEM DEVELOPMENT Dynamically Generated RFT Components RFT Controller Point-to-Point Interface Parameterized and Configurable Voting Logic Task Scheduling for RC Systems in Dynamic Fault-Rate Environments Selection Criteria for Fault-Tolerant Mode FT-mode selection using thresholds Time-resource metric for FT-mode selection Scheduler for RFT RFT architecture description Software Simulation Analysis and Results Constant fault rates Dynamic fault-rate case studies Scheduling improvements Conclusions CONCLUSIONS REFERENCES BIOGRAPHICAL SKETCH

7 Table LIST OF TABLES page 3-1 RFT fault-tolerance modes RFT controller resource usage Fault-injection results for RFT components RFT Markov model validation Unavailability and performability for LEO case study Unavailability and performability for HEO case study Resource utilization and overhead of serial MM designs Serial matrix multiplication fault-injection results Resource utilization and overhead of FFT designs FFT fault-injection results RFT controller resource usage Dynamic scheduling results

8 Figure LIST OF FIGURES page 3-1 System-on-chip architecture with RFT controller RFT controller PLB-to-PRR interface RFT fault-rate model Phased-mission Markov model transitioning between TMR and DWC modes Markov models of RFT modes RFT validation Markov models LEO fault rates using the RFT fault-rate model LEO system availability Effects of adaptive thresholds on availability and performability HEO fault rates using the RFT fault-rate model HEO system availability Matrix-multiplication architectures Matrix-multiplication psuedocode Matrix-multiplication ABFT-Extra architecture Slice overhead of fine-grained parallel matrix multiplication Slice overhead of coarse-grained parallel matrix multiplication DSP48 and BlockRAM overhead of parallel matrix multiplication Fault vulnerability of fine-grained parallel matrix multiplication Fault vulnerability of coarse-grained parallel matrix multiplication FFT architectures Required ABFT threshold value for floating-point FFTs Research areas for Phase Comparison of RFT architectures Flowchart of scheduling simulator Effect of arrival rate on fault-free operation

9 5-5 Effect of fault rate on task rejection Fault-rate profile for case studies Resource fragmentation from adaptive placement

10 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy RECONFIGURABLE FAULT TOLERANCE FOR SPACE SYSTEMS By Adam M. Jacobs May 2013 Chair: Alan D. George Major: Electrical and Computer Engineering Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the capability to provide space applications with the necessary performance, energy-efficiency, and adaptability to meet next-generation mission requirements. However, mitigating an FPGA s susceptibility to radiation-induced faults is challenging. Triple-modular redundancy (TMR) techniques are traditionally used to mitigate radiation effects, but TMR incurs substantial overheads such as increased area and power requirements. In order to reduce overhead while providing sufficient radiation mitigation, this research proposes a framework for reconfigurable fault tolerance (RFT) that enables system designers to dynamically adjust a system s level of redundancy and fault mitigation based on the varying radiation incurred at different orbital positions. To realize this goal and validate the effectiveness of the approach, three areas are investigated and addressed. First, a method for accurately estimating time-varying fault rates in space systems and a reliability and performance model for adaptive systems are needed to quantify the effectiveness of the RFT approach. Using multiple case-study orbits, our models predict that adaptive fault-tolerance strategies are able to improve unavailability by 85% over low-overhead fault tolerance techniques and performability by 128% over traditional, static TMR fault tolerance. 10

11 Second, low-overhead fault-tolerance techniques which can be used within the RFT framework for improved performance must be investigated. The effectiveness of Algorithm-Based Fault Tolerance (ABFT) for FPGA-based systems is explored for matrix multiplication and FFT. ABFT kernels were developed for an FPGA platform, and reliability was measured using fault-injection testing. We show that matrix multiplication and FFTs with ABFT can provide improved reliability (vulnerability reduced by 98%) with low resource overhead, and scale favorably with additional parallelism. Third, methods for facilitating the integration of RFT hardware into existing PR-based systems and architectures are explored. We expand the RFT framework to be used with bus-based or point-to-point architectures. We design a fault-tolerant task-scheduling algorithm which can schedule RFT tasks in a dynamically-changing fault environment in order to maximize system performability. Combined, these three areas demonstrate the capability of RFT to provide both performance and reliability in space. Using low-overhead fault-tolerance techniques and reconfiguration, RFT can meet the strict constraints of next-generation space systems. 11

12 CHAPTER 1 INTRODUCTION As remote sensor technology for space systems increases in fidelity, the amount of data collected by orbiting satellites and other space vehicles will continue to outpace the ability to transmit that data to other stations (e.g., ground stations, other satellites). Increasing future systems onboard data-processing capabilities can alleviate this downlink bottleneck, which is caused by bandwidth limitations and high latency transmission. Onboard data processing enables much of the raw data to be interpreted, reduced, and/or compressed onboard the space system before transmitting the results to ground stations or other space systems, thus reducing data transmission requirements. For applications with low data transmission requirements, improved onboard data processing can enable more complex and autonomous capabilities. These autonomous capabilities will enable new classes of space missions such as in situ scientific studies, constellations of multiple, coordinated space systems, or automated maneuvering and guidance of deep-space systems. Finally, increased onboard data processing can enable future space systems to keep up with increasingly stringent real-time constraints. However, increasing the onboard data-processing capabilities requires high-performance computing, which has largely been absent from space systems. In addition to the high performance requirements of these enhanced future systems, space environments impose several other stringent, high-priority design constraints. System design decisions must consider the system s size, weight, and power (SWaP) and heat dissipation, which are dictated by the system s physical platform configuration. The satellite s physical dimensions restrict the system s size, the photovoltaic solar array s capacity restricts the power generation, and all heat dissipation must occur from passive, radiative cooling, which can be slow. In addition to SWaP requirements, system designers must consider system component reliability and system availability 12

13 because faulty space systems are either impossible or prohibitively expensive to service. Radiation-hardened devices, with increased protection from long-term radiation exposure (total ionizing dose), provide system reliability and correctness. While device-level radiation hardening increases component lifetimes and reliability, these hardened devices are expensive and have dramatically reduced performance as compared to non-hardened commercial-off-the-shelf (COTS) components. In order to design an optimal system, the performance, SWaP, and reliability requirements of future space applications must be considered together. System design philosophy must consider the worst-case operating scenario (e.g., typically worst-case radiation), which may dramatically limit the overall system performance even though the worst-case scenario may be infrequent (e.g., radiation levels typically change based on orbital position). However, if components used for redundancy and reliability can be dynamically repurposed to perform useful, additional computation, future space systems could meet both high-performance and high-reliability requirements. Future systems could maintain high reliability during worst-case operating environments while achieving high performance during less radiation-intensive periods. Current design methodologies do not account for this type of adaptability. Therefore, in order for future space systems to achieve high levels of performance, more sophisticated and adaptive system design methodologies are necessary. One approach for adaptive high-performance space system design leverages hardware-adaptive devices such as field-programmable gate arrays (FPGAs), which provide parallel computations at a high level of performance per unit size, mass, and power [Williams et al. 2010]. Fortunately, many space applications, such as synthetic aperture radar (SAR) [Le et al. 2004], hyperspectral imaging (HSI) [Hsueh and Chang 2008], image compression [Gupta et al. 2006], and other image processing applications [Dawood et al. 2002], where onboard data processing can significantly reduce data transmission requirements, are amenable to an FPGA s highly parallel architecture. 13

14 Since reconfiguration enables FPGAs to perform a wide variety of application-specific tasks, these systems can rival application-specific integrated circuit (ASIC) performance while maintaining a general-purpose processor s flexibility. An SRAM-based FPGA can be reconfigured multiple times within a single application, allowing a single FPGA to be used for multiple functions by time-multiplexing the FPGA s hardware resources, reducing the number of concurrently active processing modules when an application does not require all processing modules all of the time. Thus, FPGA reconfiguration facilitates small, lightweight, yet powerful systems that can be optimized for a space application s time-varying hardware requirements. In order to leverage FPGAs in space systems, the FPGA must operate correctly and reliably in high-radiation environments, such as those found in near-earth orbits. Currently, most radiation-hardened FPGAs have antifuse-based configuration memories that are immune to single-event upsets (SEUs). However, these hardened FPGAs have reconfiguration limitations and small capacities, reducing the primary performance benefits offered by COTS SRAM-based FPGAs. Fortunately, when combined with special system design techniques, SRAM-based FPGAs can be viable for space systems. An SRAM-based FPGA s primary computational limitation is the possibility of SEUs causing errors within the FPGA user logic and routing resources, which can manifest as configuration memory upsets or logic memory (e.g., flip-flops, user RAM) upsets (i.e., resulting in deviations from the expected application behavior). Fault-tolerant techniques, such as triple-modular redundancy (TMR) and memory scrubbing, can protect the system from most SEUs and significantly decrease the SEU-induced errors, but designing an FPGA-based space system using TMR introduces at least 200% area overhead for each protected module. Depending on the expected upset rates for a given space system, other lower-overhead fault tolerance methods could be used to provide sufficient reliability while maximizing the resources available for performance. 14

15 When designing a traditional space system, system designers estimate the expected worst-case upset rates and include an additional safety margin. However, since single-event upset (SEU) rates vary based on orbital position and the majority of orbital positions experience relatively low upset rates, a system designed for the worst-case upset-rate scenario contains processing resources that are wasted during the frequent low-upset-rate periods. In order to provide the necessary reliability during high-upset-rate periods and reduce the processing overhead incurred during low-upset-rate periods, the fault tolerance method must change based on the current upset rate. For example, during high-upset-rate periods, the system can be reconfigured to provide high reliability at the expense of reduced processing capabilities, while during low-upset-rate periods the system can be reconfigured to provide higher performance by re-provisioning the excess hardware (used for high reliability during high-upset-rate periods) to application functionality. This upset-rate-based adaptability provides high performance while maintaining reliability. This research proposes a framework for reconfigurable fault tolerance (RFT) that enables FPGA-based space systems to dynamically adapt the amount of fault tolerance based on the current upset rate. To realize this goal and validate the effectiveness of the approach, three areas must be addressed and investigated. First, a method for accurately estimating time-varying fault rates in space systems and a reliability and performance model for adaptive systems is need to quantify the effectiveness of the RFT approach. Second, techniques for low-overhead fault tolerance which can be used within the RFT framework for improved performance must be investigated. Third, tools which facilitate the creation of the RFT hardware components are required to enable the RFT integration into existing PR-based systems and architectures. The research presented in this document is divided into three phases, each proposing a solution to the goals listed above. 15

16 The first phase of this research addresses the need for performance and reliability modelling of adaptive systems with changing environments, such as FPGA-based space systems. A fault-rate estimation methodology for systems in near-earth orbits is developed to estimate the time-varying fault rates experienced during a specified orbit. A phased-mission Markov model is then used to estimate performance and reliability of an adaptive system using several adaptation schedules. In this phase, we also develop and implement an RFT controller design which can provide adaptive fault tolerance in an FPGA. The reliability and performance models are then experimentally validated using fault-injection testing on the RFT controller. The second phase of this research investigates the effectiveness of techniques for low-overhead fault tolerance in FPGA systems. Algorithm-based fault tolerance (ABFT) is a technique that can be used with many linear-algebra operations, such as matrix multiplication or LU decomposition [Huang and Abraham 1984], to provide fault tolerance with as little as 5-10% overhead. Traditionally, ABFT has been implemented in software, with multiprocessor arrays, and in hardware, with systolic arrays, to protect application datapaths. Our ABFT approach may be used in FPGA applications to provide both datapath and configuration memory protection with low overhead. Other fault-tolerance techniques (e.g., duplication with compare, error-correcting codes, concurrent error detection, reduced-precision redundancy) are also examined and evaluated for FPGA resource overhead and reliability within the context of the RFT framework. In the third phase of this research, methods for integrating the RFT framework with pre-existing partial reconfiguration architectures will be examined, enabling fault tolerance for pre-existing systems with minimal design changes. Three aspects of system integration will be considered. First, a tool that will enable the dynamic creation of system-specific RFT controllers, allowing mission-specific, resource-optimized hardware. Second, support and integration of an RFT controller with the existing 16

17 PR architectures to enable fault tolerance with minimal design modifications. Third, investigation of fault-tolerant task scheduling in the presence of changing fault rates. The remaining sections of this paper are organized as follows. Chapter 2 surveys previous work related to this topics common to all three phases of this research. Chapter 3 describes the RFT hardware architecture, the RFT fault-rate model, and the RFT performability model and provides case-study examples for two near-earth orbits. Chapter 4 evaluates the use of low-overhead fault tolerance techniques in FPGA systems, analyzes reliability results obtained using fault-injection in MM and FFT case studies, and suggests design modifications for higher reliability. Chapter 5 demonstrates a dynamic RFT hardware-creation methodology and a point-to-point RFT controller implementation, and explores heuristics for an environmentally-aware task scheduler for RFT systems. Finally, Chapter 6 presents conclusions and outlines directions for possible future research. 17

18 CHAPTER 2 BACKGROUND AND RELATED RESEARCH In this chapter, we motivate the need for SRAM-based FPGAs in space systems. The FPGAs used in space systems must be rated for a sufficient total accumulated ionizing dose (TID) to ensure long-term device functionality over a given mission duration. Single-event effects (SEEs), caused by collisions with high-energy protons and heavy ions (i.e., radiation), are the primary short-term reliability concern in space systems. In order for FPGA-based systems to maintain reliability, a combination of radiation-hardened FPGAs and complex fault-tolerance techniques are required to mitigate errors. Since many of the common fault-tolerance techniques require substantial temporal or spatial area overhead, new, low-overhead techniques allow more reconfigurable logic to be used for actual, useful computation instead of for redundancy. 2.1 FPGA Performance and Power Efficiency SRAM-based FPGAs offer a very large amount of configurable logic and have the ability to modify portions of a design during run-time, giving these FPGAs the capability to efficiently perform a wide range of applications by using a high degree of parallelism while running at low clock rates. Williams et al. [2010] developed computational density metrics to quantify and predict performance of SRAM-based FPGAs for specific types of algorithms and applications. Their analysis showed that FPGAs were capable of providing between 3 and 60 times more performance per unit power than many conventional general-purpose processors, depending on the types of operations being considered. The performance and power efficiency of FPGAs is extremely desirable for the small power budgets of space systems. 2.2 Partial Reconfiguration Partial reconfiguration (PR) enables a user to modify a portion of an FPGA s configuration while the remainder of the FPGA remains operational. PR time-multiplexes mutually exclusive application-specific processing modules on the same hardware and 18

19 only the modules that need to be reconfigured halt operation, which makes PR attractive for real-time systems. Currently, Xilinx supports PR for the Virtex-4 and newer devices [Xilinx 2010a], while Altera has recently announced PR support for Stratix-V devices [Altera 2010]. During the system design phase, system designers define the FPGA s partially reconfigurable regions (PRRs) and partially reconfigurable modules (PRMs) and route signals to/from the PRRs through bus macro PR primitives (Xilinx 9.2 PR tool flow) or partition pins (Xilinx 12.1 PR tool flow). Partial bitstreams, communicated through external configuration interfaces (e.g., SelectMAP, JTAG), are used to reconfigure the PRRs with the PRMs. On Xilinx devices, the Internal Configuration Access Port (ICAP) is an internal configuration interface, allowing user logic to directly reconfigure PRRs, removing the need for additional external configuration support devices. Additionally, since partial bitstreams are typically much smaller than full bitstreams (which are used to configure the entire FPGA), PR reduces bitstream storage requirements. Partial bitstreams may be significantly smaller than full bitstreams, related to the size of each PRR. 2.3 Single-Event Effects SEEs occur when high-energy particles, such as protons, neutrons, or heavy ions, collide with silicon atoms on a device, depositing the ion s electric charge into the device s circuit. Protons and electrons are trapped within the Earth s Van Allen belts, while heavy ions are mainly produced by galactic cosmic rays and solar flares. When a high-energy particle collides with a silicon device, the energy of the collision can cause the logical values stored in sequential memory elements to be inverted [Karnik and Hazucha 2004]. Errors caused by these particles are often referred to as soft errors or SEUs, as there is no permanent circuitry damage and any affected memories can be corrected by re-writing the correct values. Single-event functional interrupts (SEFIs), another type of SEE, can cause a semi-permanent fault that requires a circuit to be 19

20 power-cycled to restore correct operation. Single-event latchups (SELs) are destructive SEEs that occur when a particle causes a parasitic forward-biased structure within the device substrate that can allow destructively high amounts of current to flow through the substrate, potentially damaging the device. SELs can be avoided by using appropriate device manufacturing processes, and devices produced using silicon-on-insulator (SOI) processes are largely immune. SEL immunity is an important property for selecting devices for many space systems. 2.4 FPGAs in Space Systems FPGA configuration memories can be constructed using several different technologies (e.g., antifuse, flash, and SRAM), each with performance and reliability tradeoffs. Traditionally, antifuse-based FPGAs have been used in space systems to provide simple processing capabilities and glue logic to interconnect multiple peripherals or to combine/replace the functionality of multiple ASICs. Antifuse-based FPGAs are one-time programmable where the configuration process creates/fixes the FPGA s physical routing interconnect structure. This configuration process provides an inherent level of fault tolerance from SEUs since the antifuse-based routing cannot be reversed. Additionally, many commercially available antifuse-based FPGAs (e.g., Actel RTSX-SU [Actel 2010a]) include replicated flip-flop cells to prevent upsets in the sequential logic. Additionally, antifuse devices generally have a high TID threshold and immunity to SELs. Unfortunately, when compared to flash-based or SRAM-based FPGAs, antifuse-based FPGAs contain a relatively small amount of available logic gates and the fixed-logic structure limits performance potential. Flash-based FPGAs (e.g., Actel RT-ProASIC3 [Actel 2010b]) attempt to maintain an antifuse-based FPGA s reliability while increasing the amount of configurable logic (logic available for a system designer to implement application functionality) and allowing reconfiguration. Flash-based FPGA configuration memories are composed of radiation-tolerant flash cells that provide reliability for combinational logic, however 20

21 system designers must insert sequential logic replication to fully protect the FPGA from faults. Even though flash-based FPGAs can be fully reconfigured to support multiple applications, flash-based FPGAs do not support the PR capability that is available on some SRAM-based FPGAs due to a lack of architectural and vendor support for such a capability. Concerns over the TID effects on flash-based logic (floating-gate transistors) have prevented wide-spread acceptance of flash-based FPGAs in space systems [Wang 2003]. SRAM-based FPGAs are the most radiation-susceptible type of FPGA since the design functionality is stored in vulnerable SRAM cells (i.e., configuration memory), and configuration memory upsets cause functional changes to the design s logic. Traditionally, this vulnerability has prevented SRAM-based FPGAs from being used in highly critical space applications, however, some space systems use space-qualified SRAM-based FPGAs for onboard processing. Space-qualified FPGAs are similar to COTS FPGAs but are produced using epitaxial wafers, use ceramic, hermetically-sealed packaging, and have been tested to ensure that damaging SEL events will not compromise the system. These devices are rated for TID levels high enough to be used in space systems [Xilinx 2010c]. Even with these reliability techniques, these space systems must still use several fault-mitigation strategies, such as TMR and configuration scrubbing, to ensure that system upsets are detectable and recoverable. (We note that unless specified, all further FPGA references implicitly refer to SRAM-based FPGAs.) Traditional FPGA-based space system designs leverage spatial TMR. TMR uses a reliable majority voter connected to three identical module replicas in order to detect and mask errors in any one module. In the context of TMR, a module refers to the functional unit being replicated, which can range from a single logic gate to an entire device. There are two primary TMR variations: external and internal. External TMR uses three independent FPGAs working in lockstep where each FPGA implements a module replica and the outputs are connected to an external radiation-hardened voter 21

22 that compares the results. External TMR requires significant hardware overhead (each protected module is triplicated and board layout complexity is significantly increased), but is reliable. The RCC board produced by SEAKR engineering [Troxel et al. 2008] uses external TMR to provide reliable computation using Xilinx Virtex-4 FPGAs. Alternatively, internal TMR creates three identical modules within a single FPGA, and the majority voter resides internally or externally [Carmichael et al. 1999]. Internal TMR can reduce the number of physical FPGAs required to implement a space application, but may increase the chance of a common-mode failure, where multiple modules fail simultaneously from a single fault. For example, a SEFI may cause multiple internally-replicated modules to fail, whereas externally-replicated FPGAs would be immune. Several tools assist system designers in incorporating TMR into space system designs [Pratt et al. 2006; Xilinx 2004]. In addition to TMR, configuration scrubbing prevents error accumulation in FPGA configuration memory. While TMR masks individual errors, TMR does not correct the underlying fault and cannot correct errors that occur in multiple modules. Scrubbing uses an external device to read back the FPGA s configuration memory and compares the read configuration memory to a known good copy. Alternatively, some FPGAs calculate an error correction code (ECC) during configuration read-back for every configuration frame, which can be used to detect and correct configuration faults. If a mismatch is detected, the correct configuration can be written using PR without halting the entire FPGA operation [Xilinx 2010a]. Traditionally, scrubbing is performed by an external radiation-hardened microcontroller to ensure reliability of the reconfiguration process. However, a self-scrubber may be implemented within the FPGA using the ICAP available in Xilinx Virtex-4 and newer devices. Xilinx has also developed a single-event mitigation (SEM) IP core which can perform configuration memory error detection and correction for user designs [Xilinx 2013b]. 22

23 Despite these drawbacks, SRAM-based FPGAs have been used in many space systems, including earth-observing science satellites, communication satellites, and satellites and rovers for the Venus and Mars missions. For instance, space-qualified Xilinx Virtex-1000 (XQVR1000) devices were used on the Mars Exploration Rovers for motor control functions and four XQR4000XLs were used for the lander pyrotechnics [Ratter 2004]. Configuration read-back and scrubbing were used for detection and correction of SEUs, and the full system was cycled once per Martian evening to remove persistent errors. These systems, which contained very little logic as compared to today s standards and lacked the ability for PR, were not used for data processing. The increased logic resources on recent FPGA families enable future systems to use FPGAs for image processing and other computation-intensive applications. The SpaceCube created at NASA Goddard Space Flight Center uses multiple commercial Xilinx Virtex-4 FPGAs along with a radiation-hardened processor to provide onboard processing capabilities for a variety of missions [Flatley 2010]. The SpaceCube provided computational power for the Relative Navigation Sensors (RNS) experiment during Hubble Servicing Mission 4 and is currently being used as an on-orbit test platform aboard the Naval Research Laboratory s MISSE-7 experiment on the International Space Station (ISS). Each FPGA in the system contains a self-scrubber module, protected with TMR, to correct errors and prevent error accumulation. 2.5 Low-Overhead Fault Tolerance Methods While TMR is the most common fault tolerance method for FPGAs, the high area overhead due to replicating modules partially negates some FPGA benefits. Therefore, much research has focused on developing alternative, low-overhead fault-tolerance methods for low-upset environments, such as replication-based fault tolerance [Laprie et al. 1990], ECCs [Rao and Fujiwara 1989], and application-specific optimizations to provide low-cost reliability. Many of these alternative techniques can detect errors quickly, but may require additional processing or complete re-computation to correct 23

24 the errors. When the expected upset rates are low, the re-computation rates may be acceptably low, depending on application throughput requirements. Replication-based fault tolerance represents the most commonly used type of fault-mitigation strategy, due to conceptual simplicity and high fault coverage. TMR can detect single errors and can correct/mask single errors using a majority voter. Duplication with compare (DWC) is an alternative replication-based method that compares the outputs of duplicated modules. DWC reduces the resource overhead by one half as compared to TMR, but DWC cannot correct errors and must fully re-compute data when errors are detected [Johnson et al. 2008]. Shim et al. [2004] proposed reduced-precision redundancy (RPR) for numerical computation. RPR triplicates an application module as in TMR, but the replicas have lower precision or only operate on the most-significant bits of application data, ensuring that the most-significant bits are protected and errors in the least-significant bits are treated as noise in the system. RPR reduced the number of detectable faults (fault coverage), and resulted in significant resource savings while maintaining sufficient signal-to-noise ratios for many DSP applications. Morgan et al. [2007] investigated the use of temporal redundancy and quadded logic as additional methods for providing fault tolerance through redundancy. However, due to inefficient mapping to the underlying FPGA architecture, their methods did not provide fault tolerance improvement over TMR and imposed a large area overhead. The effects of the fault-tolerant methods were offset by the increased cross-sectional vulnerability of the larger FPGA designs. Even though these replication-based methods for fault tolerance are suitable for FPGAs, several new fault-tolerant FPGA architectures leverage an FPGA s high capacity and flexibility while maintaining reliability. Alnajiar et al. [2009] proposed a hypothetical coarse-grained multi-context FPGA architecture that supported TMR, DWC, and single-context modes. Their work explored the effects of soft errors and aging on the proposed architecture, with an example Viterbi decoder module mapped to the 24

25 hypothetical architecture. Kyriakoulakos et al. [2009] proposed simple modifications to the current Virtex-5 and Virtex-6 architectures to allow native support for DWC and TMR. By adding an XOR-gate between each 5-input LUT within a larger 6-input LUT, the existing architecture and synthesis tools required only minimal changes to support their approach, while incurring only 17.5% to 76% slice utilization overhead for DWC or TMR, respectively. 2.6 Algorithm-Based Fault Tolerance Algorithm-based fault tolerance (ABFT) is a method that can be used with many linear-algebra operations to provide fault tolerance without the use of explicit replication., The ABFT method was originally described for matrix multiplication and LU decomposition [Huang and Abraham 1984] but has been expanded to protect other algorithms comprised of linear operations, such as QR decomposition, Fast Fourier Transform [Tao and Hartmann 1993; Wang and Jha 1994], and finite element analysis [Mishra and Banerjee 2003; Roy-Chowdhury et al. 1996]. The traditional description of ABFT was designed for systolic arrays, but the method has also been used in multiprocessor-based, high-performance computing [Yao et al. 2012]. ABFT augments an original data matrix with row and/or column checksums and the linear-algebra operation is performed on the new, augmented matrix. If the linear-algebra operation is computed successfully, the resulting augmented matrix will contain valid, consistent checksums [Huang and Abraham 1984]. ABFT checksum generation and comparison has lower computational complexity than the primary linear-algebra operation. ABFT computational overhead is generally low, and as a proportion of total computation, decreases as the matrix size increases. The mathematical basis of ABFT for the matrix multiplication and FFT algorithms will be shown in Section 3 and Section 4, respectively. When using ABFT with floating-point algorithms, the introduction of rounding and precision errors requires the use of a threshold comparison to differentiate rounding 25

26 errors from incorrectly computed data. The determination of a sufficient threshold is significantly influenced by properties of the input data, and can affect error coverage and the number of false positive results [Chowdhury and Banerjee 1996; Roy-Chowdhury and Banerjee 1993]. Additionally, the original description of ABFT considered algorithms executed on systolic arrays, with each processing element computing a single data element of the result matrix. In these systems, a single fault could not propagate to other processing elements and would only result in a single erroneous result element. [Silva et al. 1998] investigated these vulnerabilities in traditional ABFT implementations and proposed methods for improving fault coverage using a Robust ABFT approach. 2.7 Task Scheduling Task scheduling algorithms can be categorized as either online or offline. Offline scheduling algorithms have complete knowledge of all tasks that must be scheduled. The general scheduling optimization problem is NP-hard, but efficient heuristics exist, and offline schedules can be pre-determined at compile-time. With online scheduling, tasks arrive at the scheduler periodically over time, and must be placed around previously scheduled tasks. Task arrival rates and patterns can greatly affect the quality of the scheduler s results. Arndt et al. [2000] examined many online scheduling algorithms for distributed parallel computers, and used simulation to evaluate their performance. Of the several algorithms studied, the FirstFit algorithm, which prioritizes scheduling by the arrival time of each task, provided good schedule lengths while minimizing the average wait times. Real-time systems introduce additional constraints for task scheduling. Each task must be completed by a deadline, otherwise the results will no longer be relevant or needed. A hard deadline must be met, otherwise the system is considered failed. A firm deadline can be missed, but the usefulness of the result after the deadline is zero. A soft deadline can be missed, but the value of the result decreases after the deadline has passed. For a hard real-time system all deadlines must be met, but the goal of a soft 26

27 real-time system is to meet as many deadlines as possible while optimizing for other critera. Traditionally, schedulers attempt to minimize criteria such as makespan (total schedule length) or average task latency. Han et al. [2003] created a fault-tolerant scheduling algorithm for periodic real-time software tasks. For each primary task, an alternate less-precise task is also used to generate a sufficient result before the deadline. These alternate tasks are scheduled as close to the task deadline as possible. In the case of a primary task failure, the alternate task will be executed. If the primary task succeeded, the alternate tasks are discarded. Their algorithm is intended for offline use and was intended to protect systems against software faults. Pathan [2006] extends the rate-monotonic scheduling algorithm to support temporal TMR (RM-FT), scheduling multiple copies of tasks to mask faults in any one copy. RM-FT also requires periodic tasks to perform a scheduling analysis. Scheduling aperiodic real-time tasks is a more difficult problem and is currently being studied. Scheduling tasks for reconfigurable computing creates an additional level of complexity. Instead of scheduling a task for a one-dimensional array of processors, scheduling on a two-dimensional FPGA fabric becomes a constrained placement problem. Additionally, tasks may have multiple hardware or software implementations, increasing the overall search space. Banerjee et al. [2005] present an offline KLFM heuristic [Kernighan and Lin 1970] which incorporates detailed placement information in order to provide high-quality schedules. Mei et al. [2000] combine a genetic algorithm to determine HW/SW placement with a traditional list scheduling algorithm to enable online scheduling of real-time reconfigurable embedded systems. Steiger et al. [2004] developed two heuristic scheduling algorithms, Horizon and Stuffing, which provide good results while limiting the computational requirements. 27

28 CHAPTER 3 FRAMEWORK FOR RECONFIGURABLE FAULT TOLERANCE RFT leverages COTS components in space systems to achieve high performance and flexibility while maintaining reliability. Beyond traditional, spatial TMR, alternative fault-mitigation methods may be appropriate given a particular application s performance requirements and set of expected environmental factors (e.g., radiation). Other methods, such as temporal TMR, ABFT, software-implemented fault tolerance (SIFT), or checkpointing and rollback, may be suitable for system-level protection. Each alternative fault-mitigation method has tradeoffs between performance, reliability, and overhead, and Pareto-optimal operation may change over a system s lifetime. Therefore, the main goal of our proposed RFT framework is to enable a system to autonomously adapt to Pareto-optimal operation based on the current system s environmental situation. Our RFT framework consists of three main elements: a PR-based hardware architecture; a fault-rate model for estimating orbital SEU rates; and a methodology for modeling RFT system performability. The hardware architecture, described in Section 3.1, is similar to other traditional SoC architectures used for reconfigurable computing, with multiple, identical PRRs, which are leveraged for module redundancy. The hardware architecture allows the system to execute each processing module (PRM) in several possible fault-tolerance modes, each with differing performance and reliability characteristics. RFT software adapts the amount of per-module redundancy in accordance with the current environment and temporarily pauses only the reconfigured hardware modules to preserve and record application state while changing the fault-tolerance mode, allowing the remainder of the system to continue processing. Additionally, modules with constant state can continue processing during the adaptation process. Section 3.2 presents a model for estimating expected fault rates in potential space-system orbits. Finally, Section 3.3 presents a performability model that quantifies the RFT benefits in environments with varying upset rates. 28

29 3.1 RFT Hardware Architecture Figure 3-1 shows the high-level architecture of an FPGA-based SoC design with PRRs integrated with an RFT controller. The main architectural components include a microprocessor, a memory controller, I/O ports, PRRs (1... N) for PRMs, and the system interconnect, which connects all of these components to the microprocessor. Since we leverage Xilinx FPGAs, the microprocessor is a MicroBlaze (less resource-intensive processors such as PicoBlaze can also be used) and the system interconnect is a processor local bus (PLB). The MicroBlaze orchestrates PRR reconfiguration using the ICAP, maintains the state of the currently active PRMs, and initiates fault-tolerance mode switching. All architectural components except for the PRRs are protected using TMR since these components functionality is crucial to the entire system s reliability. Although the ICAP cannot be replicated, the signals to and from the ICAP are also protected with TMR. Tools such as Xilinx s TMRTool [Xilinx 2004] or BYU s EDIF-based TMR tool [Pratt et al. 2006] automate TMR design creation by applying low-level TMR voting on the design s original, unprotected netlist. For additional SEU protection, FPGA configuration scrubbing should be performed in order to prevent error accumulation. Scrubbing can be performed with an external scrubber and radiation-hardened configuration storage or with an internal scrubber using the internal configuration ECC present in Virtex-4 and later devices. Each PRM uses a PLB-compatible interface that connects to the RFT controller or directly to the PLB based on whether the PRM is an RFT-enabled module (a module replicated/instantiated by the RFT controller) or not, respectively. The RFT controller instantiates the bus macros or other low-level components required for interfacing with the PRRs. The RFT controller also contains multiple majority voters and comparators (voting logic) that can be used to detect or correct errors by evaluating the replicated PRMs outputs. The RFT-enabled modules are used in parallel to create 29

30 redundancy-based, fault-protection modes (e.g., DWC, TMR) by interfacing with the RFT controller s voting logic. Additionally, other single-module fault-protection modes, such as ABFT, can be used for individual PRRs and the RFT controller provides additional fault tolerance components, such as watchdog timers, to detect hanging conditions within PRMs. Table 3-1 lists the currently supported fault-tolerance modes of RFT and the modes fault-tolerance type and PRR requirements. The MicroBlaze evaluates the system s current performance requirements and monitors external stimuli (radiation) using external sensors to determine when the fault-tolerance mode should be switched, at which time the MicroBlaze reconfigures the appropriate PRRs and the RFT controller s internal voting scheme between the PRRs outputs for the new fault-tolerance mode RFT Controller Operation Figure 3-2 illustrates the interface between PRRs and the PLB for an RFT controller that can operate in single-module and redundancy-based fault-tolerance modes. This RFT controller interface routes input signals from the system interconnect to the appropriate PRRs and routes voting output signals back to the PLB. The abstraction of this logic into the RFT controller enables pre-existing PRMs to leverage the fault-tolerance modes and interface with the RFT controller with minimum modifications. To communicate data and control signals between the MicroBlaze, the PRRs, and the RFT controller, at design time, the system designer assigns a large memory-mapped region of the MicroBlaze s address space to the RFT controller. In order to route signals to specified PRRs, the system designer also subdivides this region into smaller subregions and assigns a subregion to each PRR, while taking into consideration the memory interface requirements of each potential PRM. Each PRM must implement the actual memory interface (e.g., dual-port RAM, FIFO-based interface), which allows pre-existing PRMs to interface with the RFT controller with minor modifications since no specific interface is required. When communication data from the MicroBlaze arrives 30

31 over the PLB, the RFT controller s address decoder processes the input address and determines the destination PRR(s) based on the current fault-tolerance mode. While operating in single-module fault-tolerance modes, the data is passed directly to the PRR specified by the decoded address. While operating in redundancy-based fault-tolerance modes, the data is passed to multiple PRRs using appropriate enable signals to the PRRs interfaces. The RFT controller routes the outputs from each individual PRR, the outputs from the voting logic via the FT-mode register, or recorded internal status information over the PLB to the MicroBlaze. When the MicroBlaze requests a PRR s output, the RFT controller s output multiplexer (Output Mux in Fig. 3-2) selects the appropriate PRR output to route to the PLB controller based on the decoded address from the MicroBlaze and the current FT-Mode value. In single-module fault tolerance modes, the RFT controller s output multiplexer routes signals directly from the PRRs to the PLB controller, which bypasses the RFT s internal voting logic. In redundancy-based fault-tolerance modes, the output multiplexer routes the verified outputs from the RFT controller s voting logic to the PLB controller. In addition to providing routing and voting logic for redundancy-based fault-tolerance modes, the RFT controller supports additional fault-tolerance capabilities that do not require redundant PRMs. If the system is operating in a single-module fault-tolerance mode, each PRM may use the RFT controller s watchdog timers or perform internal fault detection. The RFT controller provides each PRR with an optional watchdog timer interface using a signal that must be asserted periodically within a user-defined time interval (usually on the order of seconds). If the PRM does not assert the watchdog timer reset signal within the time interval, the RFT controller s interrupt generator asserts an interrupt signal to the MicroBlaze, alerting the MicroBlaze of a possible failed PRR. Additionally, PRMs that perform internal fault detection must include an interrupt signal to notify the RFT controller of internally detected errors, which the RFT 31

32 controller propagates to the MicroBlaze. PRMs using ABFT perform self-checking (or self-correction if the ABFT operation permits) on internally generated checksums and send an interrupt signal to the RFT controller when data errors are detected. Similarly, PRMs that use internal TMR can also signal detected and corrected errors to the RFT controller. PRMs without any specific fault-tolerance features can also use the RFT controller s watchdog timer, which still allows the RFT controller to detect module hang-ups or other operational errors MicroBlaze Operation In an RFT system, the MicroBlaze enables additional operational features, such as fault-tolerance methods to complement the fault-tolerance modes offered by the RFT controller. Additionally, the MicroBlaze maintains the system s fault-tolerance state (e.g., active PRMs, current PRRs fault-tolerance modes, etc.) and orchestrates PRR reconfiguration and the fault-tolerance mode switching process. To protect the application state while reconfiguring PRRs, the MicroBlaze provides support for PRM checkpointing and rollback. A checkpoint, which the MicroBlaze stores in external memory, consists of the minimum set of application state information needed to restore the application s current state. In the event of a fault, an application can use the previous checkpoint to roll back to a known good state, instead of wasting execution time by beginning execution at an initial starting state. A PRM s state can be checkpointed if the PRM can be read and modified by the MicroBlaze. In an RFT system, the state of all PRMs should be checkpointed periodically in order to reduce wasted computation in the event of a fault-induced PRM restart. The stored checkpoints can then be used during fault recovery procedures to improve system availability. In addition to handling PRM checkpointing, the MicroBlaze also handles fault recovery and reconfiguration procedures. If the RFT controller s voting logic detects faults in the PRMs outputs, the RFT controller records fault status information about the faulty PRM (e.g., error location, time, etc.) in internal FT-status registers and sends an 32

33 interrupt to the MicroBlaze, which initiates the reconfiguration procedure. The FT-status registers record fault status information that may be used by the MicroBlaze to make fault-tolerance mode decisions or to provide system log information to system operators. For single-module fault-tolerance modes, the MicroBlaze corrects the faulty PRM by reconfiguring the PRR with the PRM s original bitstream over the ICAP. For the DWC fault-tolerance mode, both of the associated PRMs must be reconfigured since the faulty PRM cannot be identified. If checkpoints exist for the PRMs, the MicroBlaze initializes the new PRMs using these checkpoints. Alternatively, if the RFT system is operating in the TMR fault-tolerance mode, the MicroBlaze checkpoints one of the non-faulty PRMs while the faulty PRM is reconfigured. After the non-faulty PRM has been checkpointed, the RFT controller pauses the non-faulty PRMs using clock gating to keep the PRMs synchronized. Once the faulty PRM has been reconfigured, the MicroBlaze initializes the newly reconfigured PRM from the most recent checkpoint, and the RFT controller can re-enable the clocks for all three of the replicated PRMs to resume operation. Finally, the MicroBlaze orchestrates the switching procedure between different RFT fault-tolerance modes. Fault-tolerance mode switching can be triggered by external events, by a priori knowledge of the operating environment, or by application-triggered events, and the fault-tolerance mode switching procedure may vary on a per-system basis depending on the specific system performance and reliability requirements. The MicroBlaze can use information from the fault-status registers to make Pareto-optimal decisions about future fault-tolerance modes. Before fault-tolerance mode switching, the MicroBlaze ensures that sufficient PRRs are available for the new fault-tolerance mode. Additional PRRs may be required or PRRs may be freed when switching from a single-module fault-tolerance mode to a redundancy-based fault-tolerance mode or vice versa, respectively. The MicroBlaze signals the RFT controller to change fault-tolerance modes by writing to the RFT controller s FT-Mode register. The MicroBlaze reconfigures the PRRs involved in the fault-tolerance mode switching via the ICAP with partial 33

34 bitstreams for the appropriate PRM or a blank partial bitstream if the PRR is not required for the new fault-tolerance mode. When the reconfiguration process is complete, the MicroBlaze signals the RFT controller, by rewriting the FT-Mode register, to resume PRM operation Environment-Based Fault Mitigation RFT fault-tolerance mode switching can be triggered by a priori knowledge of the operating environment, by application-triggered events, or by external events. While a priori knowledge and application-triggered events are convenient for modeling purposes, real-world systems leverage measurements from attached sensors to determine the system s current environmental status. Additionally, due to the unpredictability of space weather conditions, such as solar flare events, an RFT system must be able to respond dynamically to the changing environment. In an RFT system, the current expected fault rate can be estimated either directly or indirectly. An external radiation sensor can be directly interfaced with the FPGA, allowing the MicroBlaze to track the current fault rate and predict future fault rates. Alternatively, the RFT system can indirectly determine fault rates by tracking the number of data and configuration faults detected during operation. Since an FPGA s fabric is composed of large SRAM arrays, these arrays can be used as makeshift radiation detectors. If a fault is detected in the FPGA configuration or data memory, either through readback during scrubbing or from the RFT controller s logic, the fault can be recorded or can be used to make decisions about which RFT fault-tolerance mode to use. One simple, rule-based method for choosing the RFT fault-tolerance mode is to use a sliding window approach. For instance, a system may use the following rules: 1. If there were any faults in the past 5 minutes, use the DWC fault-tolerance mode. 2. If there were more than 5 faults in the past 5 minutes, use the TMR fault-tolerance mode. 34

35 3. Only transition from the TMR fault-tolerance mode to a lower-reliability mode after 5 minutes for fault-free operation. The size of the sliding window and the RFT fault-tolerance mode choices are systemand mission-dependent. A larger window size can provide a more conservative fault-tolerance strategy, while a small window size can more quickly adapt to spikes in the experienced fault rate. Implementing a rule with hysteresis, such a rule (3), can produce a more reliable strategy, while requiring a smaller window size RFT Controller Resource and Performance Overheads To quantify the RFT controller s resource overhead, we implemented an RFT-based system on a Virtex-4 FX60-based platform, similar to the SpaceCube. The design uses six PRRs connected to a MicroBlaze through a PLB-connected RFT controller and operates at 100 MHz. Each PRR contains 2,000 slices for user logic. Each PRR can operate in high-performance (HP) mode, DWC-mode with the PRR s neighboring PRRs, or TMR-mode with two consecutive, neighboring PRRs. The resources (LUT, FF, and slice) required for the RFT controller and constituent modules are detailed in Table 3-2. The final column of Table 3-2 shows the percentage of the Virtex-4 FX60 used by each module. The RFT controller logic, excluding the PLB controller that would be required for non-rft designs, requires approximately 900 slices. The largest modules within the RFT controller are the (optional) watchdog timers and the voting logic. Overall, the complete RFT controller uses approximately 5% of the total FPGA. Since frequent PRR reconfiguration combined with lengthy PRR reconfiguration time can impose a performance overhead, we measured the reconfiguration of a single PRR. The PRR reconfiguration time was measured using timing functions on a MicroBlaze with a PLB-attached ICAP controller running at 100 MHz. We measured the PRR reconfiguration time for a 2,000 slice PRR as approximately 15 ms. Even though the RFT mode-switching overhead is dominated by this PRR reconfiguration time, the mode-switching time incurs a negligible performance penalty due to the likely 35

36 low frequency of mode-switching. Additionally, the switching overhead only occurs when increasing the fault tolerance redundancy (i.e., switching from DWC to TMR). When decreasing the fault tolerance redundancy, the retained modules can continue running without interruption. 3.2 RFT Fault-Rate Model In order to analyze RFT s reliability benefits, a suitable fault-rate model that incorporates varying fault rates, system performance capabilities, and system fault-tolerance modes is required. Traditional reliability analysis focuses on quantifying permanent hardware failures in systems with long lifetimes. Since most processors and FPGAs have sufficiently long lifetimes, permanent hardware failures due to long-term, end-of-life failures can be ignored. Therefore, reliability analysis for space systems focuses on short-term failure analysis by modeling SEU-induced computational failures. For FPGA-based systems, these failures can cause either single or multiple data errors through corrupted data or configuration memory. FPGA upset rates for space systems are correlated with the magnetic field strength of the system s current orbital position. For example, as a system passes into the Van Allen radiation belt, the trapped charged particles in the belt have a higher likelihood of interacting with the system. For systems in a low-earth orbit, only a portion of the orbit passes through the inner Van Allen belt, at a location known as the South Atlantic Anomaly (SAA). The SAA represents the low-altitude area with a large number of trapped particles from the inner Van Allen belt s closest point to the Earth. Existing fault-rate modeling generally produces a single, average orbital upset rate, however this average is not sufficient for RFT. In order for RFT to adapt the fault-tolerance mode, the fault-rate model must estimate the expected fault rates based on the instantaneous orbital position. To account for the orbit-dependent, time-varying fault rates, data from several sources can be combined to form a more accurate estimate than a single average rate. Our fault-rate model combines orbital position 36

37 and trajectory, magnetic field strength, and cosmic ray particle interaction data to provide an accurate estimate of instantaneous fault rates. We use these fault-rate estimates as input into multiple system-level Markov models in order to calculate reliability, availability, and performability of RFT systems. Our RFT fault-rate model combines three existing models to estimate time-varying fault rates. Figure 3-3 illustrates the three models used as well as the inputs and outputs to each model. To generate accurate fault estimates, a system s time-varying orbital position must be modeled. Orbital position can be estimated using the SGP4 model, which is a simplified general perturbation modeling algorithm. NORAD, who is responsible for tracking space objects and space debris, developed SGP4 for tracking near-earth satellites [Hoots and Roehrich 1980]. SGP4 accurately calculates a satellite s position given a set of orbital elements (apogee, perigee, inclination, etc.) collectively referred to as a two-line element (TLE). Given a system s TLE, the RFT fault-rate model uses SGP4 to generate the system s position information, in terms of latitude, longitude, and altitude, over the user-defined modeling time period. Next, the RFT fault-rate model passes the SGP4 positioning information for each point along a specified orbit to the International Association of Geomagnetism and Aeronomy s (IAGA) International Geomagnetic Reference Field (IGRF) model [Maus et al. 2005], which models the Earth s magnetosphere. IGRF combines magnetosphere data collected from satellites and observatories around the world in order to create the most accurate and up-to-date model possible (the model is updated every five years). For a given orbital position, IGRF outputs a McIlwain L-parameter, representing the set of magnetic field lines that cross the Earth s magnetic equator at a number of Earth-radii equal to the value of the L-parameter. The inner Van Allen radiation belt corresponds to L-values between 1.5 and 2.5. The outer Van Allen belt corresponds to L-values between 4 and 6. In addition to identifying regions with trapped particles, the McIlwain L-parameter can be used to estimate the effect of geomagnetic shielding (cutoff rigidity) 37

38 from galactic cosmic rays. The estimated L-parameters are then used, along with the outputs of CREME96, to estimate SEU rates. The CREME96 model [Tylka et al. 1997], a publically available SEU estimation tool, generates fault-rate estimates and has been used extensively to predict heavy ion and proton-induced SEU rates, as well as estimate the expected total ionizing dose in modern electronics. CREME96 combines orbital parameters, space system physical characteristics, and silicon device process information to create a highly accurate SEU simulation. Traditionally, CREME96 generates an average fault rate for a particular orbit, generated by averaging hundreds of orbits together. However, CREME96 can also generate fault rates for orbital segments, which can be segmented using the McIlwain L-parameter. By running several simulations with very narrow L-parameter segments, CREME96 can obtain estimated fault rates for each segment. As the width of each segment decreases, the generated fault rates become more precise and continuous. The L-parameter outputs from the IGRF model are then mapped to the appropriate orbital segment and the associated fault rate. The SGP4, IGRF, and CREME96 models collectively generate a time-varying fault-rate estimate over the course of a specified orbit. We implemented the RFT fault-rate model as a C++-based program that connects the separate models together and passes data between them. SGP4 s algorithms, along with C++/Java reference code, are publically available via the Internet. A FORTRAN-based implementation of the IGRF algorithm for calculating position-based McIlwain L-parameters is publically available from NASA Goddard [Macmillan and Maus 2010]. Fault-rate information can be generated from the CREME96 model through a web-based interface, which is stored for efficient offline use. Orbits described by TLEs can be visualized using open-source tools, such as JSatTrak [Gano 2010]. The RFT fault-rate model program accepts TLE data as input and generates time-varying fault-rate estimates as output. These fault-rate estimates are then used by the RFT performability model. 38

39 3.3 RFT Performability Model System reliability is the probability that a system is operating without faults after a specified time period. Assuming exponentially-distributed random faults at rate λ, the system reliability is traditionally defined as: R(t) = e λt (3 1) Mean-time-to-failure (MTTF) and Mean-time-to-repair (MTTR), are the average amounts of time before a system encounters a failure (or repair) event. For a fault rate λ (or repair rate µ), MTTF (or MTTR) is defined as: MTTF = t R λ (t) dt MTTR = 0 0 t R µ (t) dt (3 2) System availability, which is similar to system reliability, estimates the long-term, steady-state probability that the system is operating correctly, and is defined as: A = MTTF MTTF + MTTR (3 3) System unavailability is the opposite of availability, and is often used for convenience when discussing systems with very high availability. Unavailability is defined as: UA = 1 A = MTTR MTTF + MTTR (3 4) Space system reliability and availability can be accurately modeled using Markov models. A Markov model is composed of states and transition rates. A state represents the current operating state of the system and the transition rates represent the transitions from an operating state to a failure state (i.e., failure rates), or from a failure state to an operating state (i.e., repair rates). The Markov model can be transformed into a series of equations, which can be solved or approximated numerically using tools such as SHARPE, an open-source fault-modeling tool [Sahner and Trivedi 1987], to determine probabilities of each state. System reliability and availability can be 39

40 directly determined from the calculated state probabilities. For Markov models, the instantaneous availability of repairable systems measures the probability of being in an available vs. failed state at a given point in time. These types of models are frequently used to estimate the effects of TMR, scrubbing, and other fault-tolerance methods in FPGAs and other electronics [Dobias et al. 2005; Garvie and Thompson 2004; Pratt et al. 2007]. Markov reward models, a type of weighted Markov model, can be used to extend the concept of system availability to measure system performability of adaptable systems [Ciardo et al. 1990; Meyer 1982]. Performability is a metric that combines system availability with the amount of work produced by the system and gives a measure of total work performed. Performability is especially useful for gracefully degradable systems or other systems that having changing characteristics over time. Assuming that X (t) is a semi-markov process with state space S and is continuous over time t > 0, the instantaneous performability is defined by Performability (t) = Perf (a) P{X (t) = a} (3 5) a S where Perf (a) is the system performance in state a. System performance can be defined using any desired performance metric (e.g., throughput, execution time, etc.) and performability is measured similarly. In this context, instantaneous availability can be viewed as a special case when the Perf (a) = 1 in available states and Perf (a) = 0 otherwise. For reconfigurable FPGA systems where the system configuration changes over time (e.g., our RFT architecture), the system must be modeled as a phased-mission system. A phased-mission system is described using a set of unique models for each phase of the mission. The states at the end of a given phase s model map to the states of the following phase s model during phase transitions and the phase duration can be modeled as either probabilistic or deterministic [Alam et al. 2006; Kim and Park 1994]. 40

41 We use the RFT fault-rate model s generated fault-rate estimates to drive multiple system-level Markov models in order to calculate reliability, availability, and performability of RFT systems. In order to incorporate varying fault rates (due to orbital position) and varying system topologies (due to RFT), we leverage a phased-mission Markov approach. We model the RFT system as a collection of individual phases, with each phase consisting of a period of time where the fault-tolerance mode, failure rates, and repair rates are constant. The phase lengths are both application-dependent and orbit-dependent. At the end of each phase, the pre-transition state probabilities are mapped onto initial probabilities for the post-transition Markov model. Figure 3-4 illustrates an example high-level model transitioning from TMR to DWC at time t1, and then transitioning from DWC back to TMR at time t2. Fault rates (λ) and repair rates (µ) are represented as directed graph edges between states. At each phase transition (denoted by the dashed vertical lines), state probabilities are re-mapped (dashed arrows). When re-mapping state probabilities from TMR to DWC, two TMR states are merged into a single operational state. When re-mapping from DWC to TMR, the single operational DWC state maps to the most-similar TMR state. Each fault-tolerance mode has an associated Markov model. Each state of the Markov model corresponds to the number of operational devices in a system (e.g., PRRs in an FPGA-based SoC). Transitions between states occur when a device changes state from operational to failed, or vice-versa. FPGA upset rates are estimated using the RFT fault-rate model described in Section 3.2. The system repair rates are based on the system-designer-specified scrub rate and checkpointing rate and the system and PRR reconfiguration time, which can be obtained experimentally. Transitions between fault-tolerance modes are mission-dependent. State-mapping functions ensure a continuous availability function, although performability may contain discontinuities at phase transitions. 41

42 Figure 3-5 shows Markov model representations for each fault-tolerance mode in a system with six PRRs. Figures 3-5(a) and 3-5(b) represent systems where each PRR is operating in the HP or ABFT fault-tolerance modes, respectively. Figure 3-5(c) represents a system using three independent pairs of PRMs operating in the DWC fault-tolerance mode. Figure 3-5(d) represents a system using two independent sets of three PRMs operating in the TMR fault-tolerance mode. Solid circles represent the Markov model s available operating states and dashed circles represent failed operating states. Using the definition of performability from Eq. 3 5, each state in a Markov reward model is assigned a performance throughput value. For this analysis, system performance is normalized to the work performed by a single PRR. The performance throughput value of each state is represented in the Markov model by the value in the rectangles. For example, six independent PRRs running concurrently in the HP fault-tolerance mode would have a performance throughput value of 6 while a system using six PRRs with two independent sets of PRRs operating in the TMR fault-tolerance mode would have a performance throughput value of 2. The ABFT Markov model is used to demonstrate a generic reliability model for modules that contain some form of internal fault tolerance and are capable of self-detecting data errors. In particular, ABFT provides a low-overhead method for detecting errors in certain linear algebra operations. While hardware-implemented ABFT may not have 100% fault coverage, hardware-implemented ABFT provides improvements over the HP model, which cannot detect when corrupt data is returned. In the Markov model for the single-module ABFT mode, we estimate the performance of a single module as 80% of the default, unprotected module due to the performance overhead associated with generating and comparing checksums for fault detection [Acree et al. 1993]. Additionally, ABFT fault rates are modeled as having a 20% higher fault rate than unprotected modules due to the increased module size from the ABFT logic. The repair rate for the ABFT model uses the system s scrubbing rate, supplying a 42

43 worst-case repair rate for cases when ABFT logic does not detect (or even introduces) an error, relying on external scrubbing and configuration readback to detect errors. Although these numbers are application- and implementation-specific, the numbers represent overhead costs that must be considered. The ABFT model can also be used to model other modules using user-implemented fault tolerance techniques. For a given phased-mission Markov model, a TLE is used to determine the expected fault rates using the RFT fault-rate model in Section 3.2. These fault rates, along with a description of the mission-specific criteria for using each fault-tolerance mode, are used to split a full space mission into distinct phases for Markov modeling. RFT phases are periods of time with constant fault rates, repair rates, and fault-tolerance modes. For each phase, the previous phase s state probabilities are mapped onto the new phase s Markov model as initial probabilities. SHARPE processes the new model to numerically solve the state probabilities and reliability metrics for the current phase. SHARPE s results provide the initial probabilities for the next phase. This process is repeated iteratively for each phase for the entire mission s duration and SHARPE s results are aggregated to produce overall reliability results. 3.4 Results and Analysis In this section, we present one validation case study and two orbital case studies to evaluate the potential reliability and performance benefits of our RFT framework for space systems. The validation case study uses FPGA fault injection to estimate fault rates and error coverage, which can then be used by the other case studies. The orbital case studies represent FPGA-based space systems operating in two common orbits, with multiple performance and reliability requirements. The first case study represents a space system operating in low-earth orbit (LEO), the Earth Observing-1 (EO-1) satellite. Space systems in a LEO experience relatively low radiation. The second case study represents a space system operating in a highly elliptical orbit (HEO), which is a much 43

44 harsher radiation orbit. Each case study will compare multiple adaptive fault-tolerance strategies to a traditional static TMR strategy Validation Case Study In order to validate our reliability models, faults must be injected into an executing RFT system. In this section, we present FPGA fault-injection results gathered using the Simple, Portable Fault Injector (SPFI) [Cieslewski et al. 2010]. SPFI performs fault injection using both full and partial reconfiguration, which reduces the time required to modify configuration memory and improves the speed of fault injection. We validate our reliability models by correlating SPFI s results with analytical Markov model results. For the validation case study, we implemented a simplified RFT-based system on a Xilinx ML505 FPGA development platform. The RFT-based system had a 3-PRR RFT controller that allowed HP and TMR voting and had watchdog timer functionality. Each PRR contained a matrix multiplication (MM) PRM and a MicroBlaze processor in the static region streamed data from a UART to an MM PRM and streamed results back to the UART. The SPFI fault-injection tool enabled individual system components (e.g., PRMs, RFT controller, MicroBlaze) to be tested independently without modifying the entire system. Table 3-3 shows the RFT systems fault-injection results. For each system component, Table 3-3 indicates the number of injections performed and the number of data errors and system hangs detected. The fault vulnerability for each component is scaled to the FPGA s total number of configuration bits to estimate the components design vulnerability factor (DVF). A component s/device s DVF represents the percentage of bits that are vulnerable to faults and can result in observable errors. Most Xilinx FPGA designs have a DVF that ranges from 1%-10% [Xilinx 2010b] due to the large amount of configuration memory devoted to routing. The DVF for each component is calculated by measuring the component s fault rate, estimating the number of vulnerable bits by scaling the fault rate to the size of the area occupied by the component, and dividing the number of vulnerable bits by the total number 44

45 of FPGA configuration bits. For a single MM PRM with the RFT controller using HP mode, only 1.6% of faults that occurred in the PRR were found to cause observable errors in the output. Approximately 41,497 vulnerable bits in each PRR, which occupies approximately one-eighth of the FPGA, results in a DVF MM of 0.197%. In this case, the majority of the PRR is unused, resulting in very few vulnerable bits and a low DVF. Faults injected into the RFT controller resulted in a DVF RFT of 0.022%. The MicroBlaze was not protected using TMR or other design techniques and had a DVF MB of 0.342%. Based on these component fault-injection results, the total FPGA DVF is estimated to be 0.955% (3DVF MM + DVF RFT + DVF MB ). Figure 3-6 shows Markov model representations for each fault-tolerance mode in a system with three PRRs. Figures 3-6(a) and 3-6(b) represent systems where each PRR is operating in the HP or the TMR fault-tolerance mode, respectively. These models are similar to the models presented in Section 3.3, without additional states for performability, and with the addition of state transitions to account for fault coverage. In a TMR system, coverage refers to the percentage of faults that cause the system to immediately enter the failed state. These non-covered faults occur due to designs not being fully protected by TMR. Initial fault-injection testing revealed that our system appeared to have two possible fault scenarios. In the first scenario, a fault could cause the system to remain operational but produce erroneous data. This state was recoverable using periodic scrubbing. In the second scenario, a fault could cause the system to hang until a full reconfiguration was performed. We expanded the Markov models to include these behaviors as additional states. Figure 3-6(a) shows the HP Markov model with two unavailable states that account for the two fault scenarios. The probabilities of transitioning to Faulty Data or System Hang were determined from the component fault-injection testing. The DVF FPGA was estimated from the sum of each of the components tested. 45

46 A similar approach was taken for the TMR Markov model shown in Figure 3-6(b). The additional degraded state is used to model faults that have been masked by the TMR protection provided by the RFT controller. The probability of transitioning to the degraded state is provided by the DVF MM from previous testing. The DVF sys term represents faults in the RFT controller (DVF RFT ) and MicroBlaze (DVF MB ). From fault-injection testing, we estimate the DVF sys to be approximately 0.364%. By assigning fault rates, repair rates, and coverage to the Markov model, we can calculate the system availability. Table 3-4 shows the fault and repair rates used in this analysis. Using a scrub rate of 1 fault per 10 seconds, and a repair rate of 1 scrub per 5 seconds, the RFT system will have a 98.82% availability in HP mode and 99.31% availability in TMR mode. When the fault rate to scrub rate ratio is increased, the benefits of TMR become more pronounced. Using a scrub rate of 1 fault per 2 seconds, and a repair rate of 1 scrub per 10 seconds, the RFT system will have a 92.25% availability in HP mode and 95.32% availability in TMR mode. These high availabilities, even in HP mode, are due to frequent scrubbing and the very low DVF of the FPGA design. Finally, we validate the Markov model results using fault injection. In a continuously-running RFT system, faults are injected at a specified rate, which is randomly varied using a Poisson distribution to simulate the error model used in the Markov model. Scrubbing, using partial reconfiguration, occurs at user-defined periodic intervals. Full reconfiguration occurs at the next scrubbing cycle after a PRM error has been detected by the RFT controller. Full reconfigurations also occur if the external testing program detects that the FPGA system has entered the System Hang state. Availability can be experimentally determined by the ratio of time the system is operating correctly to the total experiment run time. For each run, 10,000 faults were randomly injected into a running system. Table 3-4 shows the results of the availability experiment and the relative error from the analytical model. At low fault rates, the HP and TMR modes both provide approximately 46

47 99% availability. With high fault rates, the system had an availability of 92.59% in the HP-mode and 93.89% in the TMR-mode. The Markov model availability methodology provides a simple and effective method for determining the effects of fault and repair rates on system availability without exhaustive testing. In general, the HP models predicted slightly lower unavailability than what was observed during experimental testing while the TMR models overestimated the availability of the system. All availability results were within 1.5% of the Markov models predictions. The Markov model accuracy can be improved by providing more accurate fault-injection results. The availability values obtained through experiments can be improved by increasing the length of the testing period. Fault-injection testing did highlight implementation issues that must be handled in any high-reliability design. The use of TMR had a lower than expected benefit due to the unprotected MicroBlaze processor. Since the DVF MB was larger than the DVF MM, the availability of the system was dominated by the MicroBlaze s availability. For high reliability, the MicroBlaze must be protected with TMR or an alternative fault-tolerant processor (e.g., FT-LEON3) Orbital Case Studies For each orbital case study, the system under test is an FPGA System-on-Chip implementation of the RFT hardware architecture described in Section 3.1. In order to calculate device vulnerability, the parameters for the Xilinx Virtex-4 FX60 will be used as inputs to the RFT fault-rate model described in Section 3.2. The radiation susceptibility parameters of the Virtex-4 device family are obtained from the Xilinx Radiation Test Consortium s (XRTC) published results [Swift et al. 2008]. The generated fault rate is linearly scaled from a full device to the size of the PRR to produce PRR fault rates. The RFT controller is connected to 6 PRRs, allowing for several combinations of the TMR, DWC, and ABFT fault-tolerance modes discussed in Section 3.3. The performability 47

48 model from Section 3.3 is used to evaluate the effectiveness of each fault-tolerance strategy Low-Earth orbit case study Space systems in LEO are commonly used for earth-observing science applications, such as HSI or SAR. Both HSI and SAR have large datasets, which can be significantly reduced through on-board processing using a variety of algorithms that are decomposable into basic kernels that can be parallelized and implemented on an FPGA system for high performance and power efficiency. For example, HSI can be decomposed into a sequence of matrix multiplications and matrix inversions, and SAR can be decomposed into vector multiplication and FFTs. Since these mathematical operations are linear, ABFT can be used to protect the computation results from SEUs. For applications that cannot be protected with ABFT, TMR or DWC must be used to provide fault tolerance. The TLE used to generate the fault rates for this LEO case study is from the EO-1 satellite. The EO-1 orbit is circular at an altitude of 700 km and a inclination with a mean travel time of 98 minutes. Figure 3-7(a) shows the orbital track of EO-1 and the shaded circle represents the EO-1 s field of view. Figure 3-7(b) shows the estimated number of upsets per hour that occur in the EO-1 orbit over several orbital periods. The average fault rate of the Virtex-4 FX60 in the EO-1 s orbit is 16.5 faults per device-day (combined configuration memory and BRAM vulnerability). Each local maximum occurs when the satellite is closest to the Earth s magnetic poles. Fault rates in EO-1 s orbit are low because the orbit is lower than the Van Allen Belts and is fully within the Earth s magnetosphere, which deflects a large amount of radiation. We examine the availability and performability of TMR, DWC, and ABFT fault-tolerance modes in LEO, as well as three adaptive fault-tolerance strategies to maximize application performability: 10% two-mode, 50% two-mode, and three-mode adaptive strategies. For the adaptive fault-tolerance strategies, the fault-tolerance mode switching is determined by comparing the current upset rate with a fault-rate threshold. The 10% 48

49 two-mode adaptive strategy uses the ABFT fault-tolerance mode when upset rate is in the lowest 10% of the expected fault rates and the TMR fault-tolerance mode otherwise. The 50% two-mode adaptive strategy uses a similar strategy as the 10% two-mode, but with a higher fault-rate threshold. The three-mode adaptive strategy uses ABFT when the upset rate is in the lowest 10% of the expected fault rates, TMR when the upset rate is in the highest 50% of the expected fault rates, and DWC otherwise. In all modes, the system performs scrubbing to ensure that configuration memory errors are removed from the system. For this LEO case study, the system uses a 60-second scrub cycle, which is also used as the system repair rate for the Markov models. Figure 3-8(a) shows the availability of the LEO system while statically using the four fault-tolerance models described in Figure 3-5. While the static HP strategy s availability quickly declines (due to the lack of any fault-tolerance mechanisms), the static TMR, DWC, and ABFT strategies all maintain availability above 95%. The dynamically changing availability is directly related to the current fault rate, however, since the repair rate of the system (through scrubbing) is much larger than the expected fault rates, most configuration memory faults are rapidly mitigated. The system availability while using the three adaptive fault-tolerance strategies is shown in Figure 3-8(b). The average availability for each adaptive strategy improves availability over the static ABFT strategy, and the adaptive strategies can increase the minimum availability to above 99.5%. Table 3-5 displays the availability results for each fault-tolerance strategy in terms of unavailability, showing the probability of a system failure. The 10% two-mode adaptive strategy reduces average unavailability by 88% as compared to the static ABFT strategy. For extremely high availability, static TMR is required, as the maximum unavailability for the adaptive strategies is more than 100 times higher than TMR. The system performability for the static and adaptive fault-tolerance strategies for a LEO is also shown in Table 3-5. Due to the low overall upset rates and good availability of the static ABFT strategy, the static ABFT strategy achieves the highest 49

50 performability. The 50% two-mode adaptive strategy achieves an average performability throughput of 4.01, a 100% improvement over static TMR, while improving unavailability over the static ABFT strategy by 73%. The 10% two-mode strategy has lower average performability, while maintaining better system availability. The three-mode strategy has better performability than the static DWC mode while having better availability than the static DWC or ABFT modes, but is outperformed by the 50% two-mode strategy. We point out that the fault-rate threshold for the two-mode strategies can be used to adjust the availability and performability parameters for a space system. Figure 3-9 illustrates the effect of changing the threshold of the two-mode adaptive strategy for the LEO case study. As the threshold is raised, more time is spent using the ABFT mode, which lowers system availability while increasing performability. For the LEO case study, most of the performance gains from using ABFT can be obtained by using a low threshold value because much of the orbit will be under this threshold. With a 10% threshold, an RFT system would spend an approximately equal amount of time in each of the ABFT and TMR modes. Raising the adaptive threshold higher than 50% results in limited performance gains at the expense of decreased availability. Further analysis of these thresholds in mode-switching strategies can enable optimization toward Pareto-optimal goals Highly-elliptical orbit case study The HEO is a common type of orbit used mostly by communication satellites. From the ground, satellites traveling in an HEO can appear stationary in the sky for long periods of time. HEOs also offer visibility of the Earth s polar regions, where most geosynchronous satellites do not. The HEO used for this case study is a Molniya orbit, named for the communication satellites that first used this orbit. A TLE for a Molniya-1 satellite was used to generate fault rates for the HEO case study. This orbit has a perigee of 1,100km, an apogee of 39,000km, and a 63.4 inclination with a mean travel time of 12 hours. The average amount of radiation throughout the orbit is much higher 50

51 than the LEO case study, and much larger amounts of radiation are encountered when the satellite passes through the Van Allen belts. The average fault rate in the Molniya-1 orbit is 62 faults per device-day. For most of the orbit, the fault rate averages 7 faults per device-day, but the large fault-rate peaks that occur near perigee increases the overall fault rate. Figure 3-10(a) illustrates the Molniya-1 orbit and Figure 3-10(b) shows the estimated number of upsets per hour that an FPGA might experience in an HEO. Space systems outside of the Earth s magnetosphere, either in geosynchronous orbit or in interplanetary space, experience constant fault rates that vary based on the occurrence of solar flares and other space weather conditions. Solar flares can send a wave of high-energy particles into space, causing a brief period of extremely high fault rates. These fault-rate spikes, while different in origin, look similar to the fault-rate peaks that occur in this case study. The analysis used in this case study can also be used to estimate RFT system performance in the presence of different space weather conditions. We evaluate the availability and performability of TMR, DWC, and ABFT fault-tolerance modes in an HEO. In order to maximize performability of applications in an HEO, we also examine two adaptive fault-tolerance strategies. The ABFT/TMR adaptive strategy uses the ABFT fault-tolerance mode when upset rates are in the lowest 5% of expected fault rates and the TMR fault-tolerance mode otherwise. The DWC/TMR adaptive strategy uses the same fault-rate thresholds as ABFT/TMR, but switches between the DWC and TMR modes. In each mode, the system performs scrubbing to ensure that configuration memory errors are removed from the system. For the HEO case study, the system uses a 10-second scrub cycle to account for the increased average fault rates experienced in an HEO, which is used as the system repair rate for the Markov models. Figures 3-11(a) and 3-11(b) show the availability of the HEO case study system while using three static fault-tolerance strategies and two adaptive strategies. The static TMR strategy maintains an average availability of 99.93%, but the availability 51

52 drops as low as 95.1% while passing through the peak fault-rate periods. While using the static DWC and ABFT strategies, the availability drops significantly during the peak fault-rate periods, making these strategies unsuitable for systems that must maintain continuous operation (due to the extremely high upset rate, many systems shut down operation during peak fault-rate periods). However, outside of the peak fault-rate periods, the minimum availability for the static DWC (99.8%) and ABFT (99.3%) strategies are high enough to be tolerable by many applications. The two adaptive strategies use DWC and ABFT fault-tolerance modes during low fault-rate periods and TMR fault-tolerance mode during the peak fault-rate periods. Using the adaptive strategies, the availability never falls below the TMR strategy s minimum availability of 95.1% during peak fault-rate periods, while maintaining higher availability at other times. (Note the change of scale in Figure 3-11(b).) Table 3-6 shows the unavailability and performability of the fault-tolerance strategies described in this section. The DWC/TMR adaptive strategy reduces average unavailability by 80% as compared to the static DWC strategy. The ABFT/TMR adaptive strategy reduces average unavailability by 85% as compared to the static ABFT strategy. The system performability for the static and adaptive fault-tolerance strategies for an HEO are shown in Table 3-6. While the TMR strategy exhibits the lowest performability, the TMR strategy also has the lowest variation in performability and highest availability, allowing for predictable performance levels. The performability of the static ABFT and DWC strategies is significantly reduced in the peak fault-rate periods, but quickly returns to acceptable levels after passing through the peak fault-rate periods. The maximum unavailability for each of the adaptive strategies occurs during the peak fault-rate periods. Since the RFT system is using the TMR fault-tolerance mode during these segments, the adaptive strategies have the same maximum unavailability as the static TMR strategy. The DWC/TMR adaptive strategy increases performability over the TMR strategy by 46%. The ABFT/TMR adaptive strategy increases average performability 52

53 over the TMR strategy by 128%, or reduces performability by 4% over the static ABFT strategy, while significantly improving unavailability over the ABFT strategy. 3.5 Conclusions In this work, we have presented a novel and comprehensive framework for reconfigurable fault tolerance capable of creating and modeling FPGA-based reconfigurable architectures that enable a system to self-adapt its fault-tolerance strategy in accordance with dynamically varying fault rates. The PR-based RFT hardware architecture enables several redundancy-based fault-tolerance modes (e.g., DWC, TMR) and additional fault-tolerance features (e.g., watchdog timers, ABFT, checkpointing and rollback), as well as a mechanism for dynamically switching between modes. The combination of these fault-tolerance features enables the use of COTS SRAM-based FPGAs in harsh environments. Future work will automate and optimize the creation RFT controllers for specific system configurations in order to increase developer productivity, facilitate adoption, and reduce system overhead. In addition to the hardware architecture, we have demonstrated a fault-rate model for RFT to accurately estimate upset rates and capture time-varying radiation effects for arbitrary satellite orbits using a collection of existing, publically available tools and models. Our model provides important characterization of fault rates for space systems over the course of a mission that cannot be captured by average fault rates. The HEO case study demonstrated the large range of fault rates experienced in certain elliptical orbits, and the potential for performance improvements when using RFT. Using the results from the fault-rate model, our Markov-model-based RFT performability model was used to demonstrate the benefits of using an adaptive fault-tolerance system architecture. The performability model was validated using FPGA fault injection on an experimental RFT system and multiple static and adaptive fault-tolerance strategies for space systems in dynamically changing environments were evaluated. The RFT performability model demonstrated that while TMR provides 53

54 a lower bound on availability, less reliable, low-overhead methods can be used during low-fault-rate periods to improve performance. We evaluated two case study orbits and observed that adaptive fault-tolerance strategies are able to improve unavailability by 85% over static ABFT and performability by 128% over traditional, static TMR fault tolerance. This additional performance can lead to more capable and power-efficient FPGA-based onboard processing systems in the future. Table 3-1. RFT fault-tolerance modes. Fault-tolerance mode Fault-tolerance type PRRs required Triple Modular Redundancy (TMR) Redundancy 3 Duplication with Compare (DWC) Redundancy 2 High-Performance (HP) - no fault protection Single-module 1 Algorithm-Based Fault Tolerance (ABFT) Single-module 1 Internal TMR Single-module 1 Table 3-2. RFT controller resource usage. Module name LUTs used FFs used Slices used FPGA utilization PLB Controller % Address Decoder % RFT Registers % Watchdog Timers % Voting Logic % Output Mux % Total % Table 3-3. Fault-injection results for RFT components. System Faults Data System Vulnerable DVF Component injected errors hangs bits (est.) (%) Matrix Multiply (MM) 100,000 1, , % RFT Controller (RFT) 100, , % MicroBlaze (MB) 100, ,219 86, % Full FPGA 0.955% 54

55 Table 3-4. RFT Markov model validation. Fault period Repair period FT-mode Markov model Experimental Model (time/fault) (time/scrub) availability availability error 10s 5s HP 98.82% 98.99% 0.2% 10s 5s TMR 99.31% 99.11% 0.2% 2s 10s HP 92.25% 92.59% 0.4% 2s 10s TMR 95.32% 93.89% 1.5% Table 3-5. Unavailability and performability for LEO case study. Average Maximum Average Minimum unavailability unavailability performability performability TMR DWC ABFT Mode (10%) Mode (50%) Mode Table 3-6. Unavailability and performability for HEO case study. Average Maximum Average Minimum unavailability unavailability performability performability TMR DWC ABFT DWC/TMR ABFT/TMR TMR Components System Interconnect (PLB) Microprocessor (MicroBlaze) RFT Controller Memory Controller I/O Ports (UART, USB) ICAP PRR 1 PRR 2 PRR N 1 PRR N Figure 3-1. System-on-chip architecture with RFT controller. 55

56 Voter Result Fault Syndrome PLB Controller Data In Address Address Decoder FT Mode Register Watchdog Timers FT Status Registers Data Out Output Mux Interrupt Interrupt Generator RFT Controller Voting Logic PRR1 Interface Internal PRM signals PRR Out Watchdog reset PRR2 Interface Internal PRM signals PRR Out Watchdog reset PRR3 Interface Internal PRM signals Watchdog reset PRR Out PR Regions Figure 3-2. RFT controller PLB-to-PRR interface. Time Period Two-Line Element (TLE) User- Supplied Data SGP4 Model Orbital Parameters System Physical Characteristics Device & Process Parameters Positioning Information CREME96 Model IGRF Model McIlwain L-Parameters Segmented Fault Rates L-Parameter / Fault Rate Mapping Fault Rate Estimates Figure 3-3. RFT fault-rate model. 56

57 Zero Faults Zero Faults 3 3 One Fault 1 Zero Faults One Fault 1 2 Failed Failed 2 Failed 2 t 0 TMR t 1 DWC t 2 TMR t 3 Figure 3-4. Phased-mission Markov model transitioning between TMR and DWC modes. 6 λ 5 λ 4 λ 3 λ 2 λ Fully Operational Failed Module Failed Modules Failed Modules Failed Modules Failed Modules λ 6 λ 4.8 λ 3.6 λ 2.4 λ 1.2 λ Fully Operational Failed Module Failed Modules Failed Modules Failed Modules Failed Modules 0.8 µ µ µ µ µ µ 6 λ 4 λ Fully Operational Failed Module Failed Modules λ µ 3 Failed Modules 0 µ µ Fully Operational 2.0 µ 6 λ µ One Degraded Instance λ 3 λ µ One Failed Two Degraded TMR Instance Instances λ 4 λ µ One Failed, one degraded TMR Instance λ Multiple Modules Failed 0 µ 1 λ 6 Failed Modules 0 6 Failed Modules 0 A B C D Figure 3-5. Markov models of RFT modes. A) 6x HP. B) 6x ABFT. C) 3x DWC. D) 2x TMR. 57

(DVFsys)(Pdata) (DVFsys)(Phang) (DVF FPGA )(P data ) (DVF FPGA )(P hang ) 3 (DVFMM) Operational 1 Operational 1 2 2 (DVFMM) (DVFsys)(Pdata) Degraded 2 (DVFsys)(Phang) Faulty Data System Hang Faulty

58 (DVFsys)(Pdata) (DVFsys)(Phang) (DVF FPGA )(P data ) (DVF FPGA )(P hang ) 3 (DVFMM) Operational 1 Operational (DVFMM) (DVFsys)(Pdata) Degraded 2 (DVFsys)(Phang) Faulty Data System Hang Faulty Data 1 System Hang (DVF FPGA )(P hang ) (DVFsys)(Phang) A B Figure 3-6. RFT validation Markov models. A) HP mode. B) TMR mode. A 60 Faults per device day Time (minutes) B Figure 3-7. LEO fault rates using the RFT fault-rate model. A) Visualization of the EO-1 TLE data using JSatTrak [Gano 2010]. B) Expected fault rates over several orbits. 58

59 HP DWC TMR ABFT Availability Time (minutes) A 2 mode (10%) 2 Mode (50%) 3 Mode Availability Time (minutes) B Figure 3-8. LEO system availability. A) System availability of static strategies. B) System availability of adaptive strategies. 59

60 Performability Availability Availability Performability % 20% 40% 60% 80% 100% 0 2 Mode Strategy Threshold Figure 3-9. Effects of adaptive thresholds on availability and performability. 60

A 2500 Faults per device day 2000 1500 1000 500 B 0 0 1000 2000 3000 4000 5000 Time (minutes) Figure 3-10.

61 A 2500 Faults per device day B Time (minutes) Figure HEO fault rates using the RFT fault-rate model. A) Visualization of Molniya-1 TLE data using JSatTrak [Gano 2010]. B) Expected fault rates over time. 61

62 TMR DWC ABFT TMR DWC ABFT Availability Availability A Time (minutes) Time (minutes) DWC/TMR ABFT/TMR DWC/TMR ABFT/TMR Availability Availability B Time (minutes) Time (minutes) Figure HEO system availability. A) System availability of static strategies (zoomed image on right). B) System availability of adaptive strategies (zoomed image on right). 62

63 CHAPTER 4 ALGORITHM-BASED FAULT TOLERANCE FOR FPGA SYSTEMS Algorithm-based fault tolerance (ABFT) is a fault-tolerance method that can be used with many linear-algebra operations, such as matrix multiplication or LU decomposition [Huang and Abraham 1984]. Additionally, the ABFT approach can be extended to other linear operators such as the Fourier transform. Fortunately, many common space applications are composed of linear-algebra operations; e.g., HSI features matrix multiplication [Jacobs et al. 2008], while SAR features Fast Fourier transforms. Other algorithms can often be converted to fit an algebraic framework. Traditionally, ABFT has been implemented in software, with multiprocessor arrays, and in hardware, with systolic arrays, to protect application datapaths. Our ABFT approach may be used in FPGA applications to provide both datapath and configuration memory protection with low overhead. By demonstrating the effectiveness of ABFT in FPGA systems, we can enable the use of FPGAs in future space missions where resource constraints and reliability are the major challenges. In the larger perspective of an RFT system, ABFT enables additional design-space options for optimizing total system performability. In this chapter, we present an analysis of multiple fault-tolerance methods on Xilinx FPGAs including TMR and ABFT. We examine the resource usage of each method and measure the vulnerability of the design using a fault-injection tool. We then examine possible design tradeoffs and modifications that can enable higher reliability. The following sections of this chapter are organized as follows. Section 4.1 describes multiple hardware architectures for ABFT matrix multiplication, along with analysis of resource usage and fault vulnerability. Section 4.2 describes an ABFT-based algorithm for FFTs, FFT case study architectures, and fault vulnerability analysis. Section 4.3 presents conclusions, provides suggestions for developing more reliable designs, and outlines directions for future research. 63

64 4.1 Matrix Multiplication Matrix multiplication (MM) is used as a key kernel in a large number of signal-processing applications, and has been shown to benefit from the performance of FPGAs [Dave et al. 2007; Wu et al. 2010; Zhuo and Prasanna 2004]. The traditional formulation of ABFT uses MM as a motivating case, demonstrating the method s ability to detect and correct errors while limiting computational complexity and overhead. MM also has the benefit of simple parallel decomposition strategies, trading additional FPGA resource usage for decreased total execution time, which work well with ABFT. In Section 4.1.1, the mathematical description of ABFT for MM is presented. In Section 4.1.2, several possible MM-ABFT architectures are described. Section analyzes the resource overhead tradeoffs for each of the proposed architectures. Section presents an analysis of fault injection results obtained from each of the proposed architectures. Finally, Section summarizes the results from the matrix multiplication analysis Checksum-Based ABFT for Matrix Multiplication The following definitions provide the mathematical background for ABFT. To obtain the weighted checksums, the initial data will have to be multiplied by an encoder matrix. Without a loss of generality and to simplify the notation, we assume that generic matrix is square with dimensions of N N. Definition 1: An encoder matrix is a matrix whose product with the data matrix will yield the desired checksums. For the remainder of this paper we will refer to the encoder matrix as E N. Depending on the size of the encoding matrix, E N may encode more than one checksum row or column. The E N used in this paper will have dimensions of N 1. E N = [ ] T (4 1) 64

65 Definition 2: A column checksum matrix A C is an initial data matrix A that has been augmented with extra rows of checksums. Such a matrix will have dimensions of (N + 1) N and has the form: A C = A E T N A (4 2) Similarly, a row checksum matrix A R can be obtained by augmenting a data matrix A with additional columns. Such a matrix will have dimensions of N (N + 1) and has the following form: A R = [ A A E N ] (4 3) Definition 3: The product of a column checksum matrix A C and a row checksum matrix B R will produce a full checksum matrix C F. Such a matrix will have dimensions of (N + 1) (N + 1) and the form: A C B R = = = A E T N A A B E T N A B C E T N C [ B B E N ] A B E N E N T A B E N C E N E T N C E N (4 4) = C F The associative property of the matrix product allows for verification of the multiplication procedure by simply recalculating the checksums and comparing them with ones obtained through the matrix multiplication. In general, operations that preserve 65

66 weighted checksums are called checksum-preserving and the matrix product is an example of such a function Matrix-Multiplication Architectures Several matrix multiplication modules were created to examine the reliability of hardware-based ABFT. This section discusses a serial architecture and two possible parallelization strategies for MM and the design decisions that were made for the ABFT architectures. For this analysis, 32-bit integer precision was used in each design Baseline, serial architecture The minimal-hardware, serial architecture for the matrix multiplication function consists of a single multiply-accumulator (MAC), an address generation module, and three data-storage modules (RAM) as shown in Figure 4-1(a). Two memories are used to store the input matrices A and B, and one memory is used for the resulting output matrix C. The address generator iterates through the correct matrix indices (i, j, k in Figure 4-2), sending data stored in the two input RAMs to the MAC, and generates the appropriate address for output values. This MM module can be used for any size matrix, the limiting factor being data storage. For this analysis, the input and output matrices are each stored in a 1024-deep 32-bit Xilinx BlockRAM (BRAM) component. This architecture requires O(N 3 ) cycles to fully calculate an output matrix. MM computational throughput can be improved by exploiting parallelism with additional MAC units Fine-grained parallel architecture The fine-grained parallel MM architecture unrolls the inner loop of the MM algorithm as shown in Figure 4-2. Each element in the result matrix C is the dot product of a row from matrix A and a column from matrix B. This parallel architecture uses multiple processing elements to compute the dot-product in parallel. With the fine-grained parallel architecture shown in Figure 4-1(b), the output of several multipliers are connected to an adder-tree structure, allowing the parallel computation of partial dot-products, which are then accumulated into the final, full dot product. By fully 66

67 parallelizing the dot product (using N multipliers), the execution time of the full algorithm can be reduced from O(N 3 ) to O(N 2 ). This method requires accessing multiple memory elements in parallel, and may be limited by the total number of memory elements or DSP components on the FPGA Coarse-grained parallel architecture The coarse-grained parallel approach essentially unrolls the outer loop of the MM algorithm in Figure 4-2. Each row in the result matrix is calculated from a row in matrix A and every element of matrix B. This data parallelism enables each processing element to calculate a unique portion of matrix C by using a portion of matrix A and all of matrix B, as shown in Figure 4-1(c). When each processing element computes a single row, the execution time of the full algorithm can be reduced from O(N 3 ) to O(N 2 ). This method requires reading multiple values from matrix A and writing multiple values to matrix C in parallel, and requires the same total number of memory and DSP elements as in the fine-grained parallel architecture Architectural modifications for ABFT The addition of ABFT logic requires the creation of two functions, ABFT checksum generation and ABFT checksum verification. Each of these functions requires a simple accumulator (or a MAC for weighted checksums). In order to accommodate the calculated checksum data, the BRAM storage must be large enough to hold the ABFT-augmented matrices. Checksum generation sums each column of matrix A and writes the checksum into the matrix A BRAM, creating the column checksum matrix, A C. Next, checksum generation performs the same process for the rows of matrix B to create a row checksum matrix, B R. The checksum verification function sums the columns and rows of matrix C and compares the sums to the checksum values in the matrix C BRAM. If a mismatch is detected, an error signal is asserted until the module is reset. For some applications, such as image processing, data errors that occur in low-significance bits may be ignored. ABFT accomplishes this by comparing the 67

68 difference of two generated checksums to a user-defined threshold value. However, for maximum coverage, this threshold should be set to zero for integer operations. Figure 4-3 shows an example of an ABFT-enabled MM architecture where an additional MAC is used for the checksum generation and verification functions. The MAC hardware that exists for the main MM operation may be reused for creating checksums (ABFT-Shared), or an additional accumulator can be used for this purpose (ABFT-Extra). For the baseline architecture, an additional ABFT accumulator would incur almost 100% overhead. However, when the ABFT encoding matrix in Equation 1 is used, multipliers are not required and the ABFT hardware can be simplified. For parallel designs with multiple processing elements, the overhead of a single MAC for ABFT calculations is amortized. To implement ABFT error correction, the column and row indices of faulty rows can be temporarily stored in registers. Faulty elements exist at the intersection of a faulty row and faulty column. To correct the faulty element, the checksum-generation module must recalculate the column checksum, ignoring the value at the faulty row index. This sum would be subtracted from the matrix C checksum value to obtain the correct value, and stored in the matrix C RAM. Using the encoding matrix defined in Equation 1, ABFT will be able to detect single- and multiple-element errors. However, for specific distributions of multiple-element errors, the correction algorithm may fail. Weighted-checksum encoding matrices can be used to provide additional fault localization capabilities for multiple-element errors. The ABFT designs discussed in later sections perform error detection only Resource-Overhead Experiments In this section we analyze the overhead of the architectures presented in Section and compare them to traditional fault-tolerance mitigation strategies. For this analysis, we use a 32-bit integer MM module and vary the number of processing elements and the type of parallelization. These MM modules can perform computation 68

69 on matrices up to elements in size, limited only by the amount of BRAM dedicated to storage. We compare fine-grained and coarse-grained parallel MM designs with several fault-tolerant designs (TMR, ABFT, and hybrid TMR/ABFT). The baseline and ABFT designs were synthesized using the Xilinx synthesis tool while the TMR designs used Synplify Premier. Each design was implemented on a Xilinx ML605 development board with a Virtex6-LX240T FPGA. The results of this comparison, with 1 processing element per design, are shown in Table 4-1, and will be discussed below Resource overhead of serial architectures The baseline (non-fault-tolerant) MM architecture with a single processing element uses 3 BRAMs to store input and output data. Matrix A and B are each stored in separate BRAM components in order to allow simultaneous accesses, while matrix C only requires a single BRAM. DSP48 components are reserved for the multiply-accumulate unit. The multiply-accumulator can be implemented in 3 DSP48 units (3 per 32-bit MAC). All remaining addressing and indexing logic (counters, state machines, etc.) are implemented using slices (LUTs and FFs). The ABFT-Shared design does not require any additional BRAM or DSP48 units over the baseline design. The additional logic needed to handle addressing matrices during checksum generation and verification increases the number of required slices by 14%. The ABFT-Extra design uses 100% more DSP48s and 32% more slices for the additional MAC; no additional BRAMs are needed. For comparison, Table 4-1 also shows the resource usage of a TMR design. The serial TMR design has 146% overhead as compared to the baseline design. This TMR design was created using the high-reliability features of the Synplify Premier synthesis tool, which inserts low-level TMR voting into the ABFT-MM core s netlist. Alternative methods for creating TMR designs are available from Xilinx [Xilinx 2004], BYU-LANL [Pratt et al. 2006], Mentor Graphics [Mentor Graphics 2013], and others. As expected, 200% more DSPs and BRAMs are required for TMR. However, slice usage in 69

70 the TMR design also increased less than expected. This difference may be caused by optimization during the Xilinx mapping and place-and-route processes. Since a Xilinx logic slice has many internal components (multiple lookup tables), additional logic may be packed into partially-used slices, reducing the need for additional slices. We also examine a hybrid design based on the ABFT-Extra design which uses TMR on the address generator and all state machines within the design, but only uses ABFT along the data path. This hybrid approach results in a fine-grained design that has approximately 156% overhead for slices but only 100% overhead on the limited DSP48 resources Resource overhead of parallel architectures As additional processing elements are added to the existing serial MM designs, the resources required for the data path should increase much more than the control path. This asymmetric scaling results in resource overhead being dependent on the amount of parallelism. Figure 4-4 and Figure 4-5 compare the slice overhead of each of the designs discussed in Section while varying the number of processing elements from 1 to 32, which is a fully parallel implementation for the MM. Figure 4-6 compares the DSP48 resource overhead for each of the designs. For each parallel design, both coarse-grained and fine-grained parallelizations are examined. Overhead of each fault-tolerant design is calculated based on the resource usage of the equivalent non-fault-tolerant design. In the serial designs, the overhead of both fine-grained and coarse-grained ABFT-Extra designs is higher than the ABFT-Shared designs. However, the ABFT-Extra slice overhead decreases for highly parallel designs. Alternatively, the ABFT-Shared designs show the opposite trend. For modules with a large number of processing elements, the ABFT-Shared control logic can require significantly more slices than the baseline, unprotected design. The overhead in the ABFT-Shared designs is attributed 70

71 to additional multiplexers needed to correctly share access to the design s processing elements. Each of the explored ABFT designs have less than 60% slice overhead. The overhead of the TMR designs ranges from 65% to 150% additional slices, significantly lower than the expected 200% overhead. In the fine-grained designs, relative overhead was smaller in the highly-parallel implementations. However, for the coarse-grained designs, the overhead increased. The TMR design has a higher overhead (200%+) of LUTs, however each slice is composed of multiple LUTs, and a slice is considered used if any of the LUTs are used. The TMR design can have a lower slice overhead when the design can be more efficiently packed into slices by using more LUTs per slice. The fine-grained architecture is able to take advantage of the efficient packing, while the coarse-grained architectures do not. For large designs (e.g., 32 PEs), the number of fully-used slices in the original design increases, increasing TMR overhead by requiring additional slices for the TMR logic. The slice overhead of the hybrid ABFT/TMR design is comparable to the TMR design. Designs with many processing elements have less resource overhead than smaller designs. The fine-grained design s slice overhead scales much more favorably than the coarse-grained design, which is correlated to the slice overhead for the TMR design. This slice overhead could be reduced by more selectively choosing which portions of logic to protect with TMR. The limiting resources for the MM designs are the DSP48 and BRAM components. Figure 4-6 shows the overhead of DSP48 and BRAM components for the various designs. The results are identical for the fine-grained and coarse-grained parallelizations. As more processing elements are used in the ABFT-Extra MM module, the DSP48 overhead created by the additional ABFT MAC unit becomes extremely low (3.125% with 32 processing elements). The ABFT-Shared design requires zero additional DSP48 units. Additionally, both the ABFT-Shared and ABFT-Extra methods do not require additional BRAMs. The Hybrid design has the same DSP48 overhead as the 71

72 ABFT-Extra design because only control logic is replicated. The DSP48 and BRAM overhead for the TMR design was expected to be 200%, independent of the amount of parallelism. However, the BRAM usage actually increased by less than 100% for non-serial designs due to the underlying FPGA architecture, design size, and because of optimizations performed by Synopsys synthesizer. The Virtex-6 BRAM can be used as one 1024-element RAM or as two 512-element RAMs. The Xilinx tool will only infer the larger, 1024-element RAMs, while Synopsys can more efficiently use the smaller 512-element RAMs. Therefore, the 200% overhead is mitigated by the need for half as many large BRAMs, and the BRAM overhead will asymptotically approach 50% with more processing elements. From a resource overhead perspective, the ABFT designs have the desired properties of a low-overhead fault tolerance method. All of the examined ABFT designs exhibited a slice overhead of less than 60%, with extremely low DSP48 resource overhead. The hybrid ABFT/TMR should provide additional fault tolerance while not requiring replication of the limited DSP48 resources on the FPGA. In the next section, fault-injection testing will determine the effectiveness of the designs discussed in the section Fault-Injection Experiments While Section has shown that ABFT provides lower overhead compared to other fault-tolerance strategies, the reliability of ABFT must also be evaluated. In order to validate our ABFT design, faults must be injected into an executing system. In this section, we present FPGA fault-injection results gathered using the Simple, Portable Fault Injector (SPFI) [Cieslewski et al. 2010]. SPFI performs fault injection by modifying FPGA configuration frames within a design s bitstream, re-programming the FPGA, and comparing the resulting output against known values. Partial reconfiguration is used to reduce the time required to modify configuration memory and to improve the speed of fault injection. However, due to the length of time required to exhaustively test an entire 72

73 FPGA design, statistical sampling is used to estimate the total number of vulnerable bits in a given design. We implemented multiple fault-tolerant designs discussed in Section on a Xilinx ML605 FPGA development platform. Each design used a 32-bit integer MM module with up to 32 processing elements. A UART was also implemented on the FPGA to stream input test vectors to the MM module and to report results back to a verification program. During the design phase, the MM module is constrained to a small portion of the FPGA. The SPFI fault-injection tool enables targeted injections, allowing the UART to be avoided during fault injection. Table 4-2 shows the fault-injection results for each of the fault-tolerant MM designs from Section For each design, the table indicates the number of injections performed, the number of undetected data errors, and system hangs detected. The measured percentage of faults is then scaled to the size of the injection area to estimate the total number of vulnerable bits. In this analysis, vulnerable bits are the total number of bits which cause silent (undetected) data errors or system hangs. For ABFT designs, bits resulting in false-positive results were not considered vulnerable, since they do not allow faulty data to be propagated to back to the host system. The fault vulnerability for each component is divided by the FPGA s total number of configuration bits to estimate the components design vulnerability factor (DVF). A design s configuration DVF represents the percentage of configuration bits that are vulnerable to faults and can result in errors. The total DVF of a component includes vulnerable configuration bits and BRAM data bits which may affect computation results. Reliability of a given design can then be calculated from the FPGA s total fault rate scaled by the design s DVF. Most Xilinx FPGA designs have a DVF that ranges from 1% to 10% [Xilinx 2010b] due to the large amount of configuration memory devoted to routing. 73

74 Design vulnerability of serial architectures The non-fault-tolerant baseline MM design has an estimated 14,802 vulnerable configuration bits. The majority of these vulnerable bits results in undetected data errors. Other bits cause the system to become unresponsive, or hang. These system hangs may be caused by errors in the MM state machine logic, or by errors in FPGA routing logic which prevents connections from the MM module to the communication UART. The baseline design is small and uses less than 0.5% of the total logic slices available on the FPGA. However, the configuration DVF of the baseline design is 0.026%. When vulnerable data stored in BRAMs is also included, the total DVF of the baseline design is 0.198%. The ABFT-Shared design has an estimated 8,756 vulnerable bits (approximately 60% of the unprotected design) and a DVF of 0.016%. The vulnerable bits can affect address generation, the checksum generation or verification, or the error detection status register. BRAM data bits are initially unprotected, but are effectively safe once the ABFT checksum generation process has taken place. If the input BRAMs, prior to the checksum calculation, are considered vulnerable, the ABFT-Shared design has a DVF of 0.127%. In the ABFT-Extra design, the independent MAC unit for checksum generation and verification was expected to isolate faults in the main data path from the ABFT checksum calculations, leading to a more reliable design. However, in the comparison of serial designs, the vulnerabilities of each ABFT design were similar, but the ABFT-Shared design had 5% fewer vulnerable bits than the ABFT-Extra design. Both ABFT designs experience a similar amount of data errors, but fewer faults cause the ABFT-Extra design to hang. The TMR design has an estimated 1,390 vulnerable bits and a total DVF of 0.002%. The majority of these bits are the result of routing faults or TMR majority voter faults. This result represents a realistic lower bound on total vulnerability of the MM design. The TMR reliability results in Table 4-2 were obtained by using the Synplify Premier 74

75 high-reliability tool, which resulted in significantly more reliable designs than a naive VHDL-level TMR approach, since the tool performs TMR with finer granularity and more frequent voting. The hybrid ABFT design has approximately 30% lower vulnerability than the ABFT-Extra and ABFT-Shared designs, but higher vulnerability than the TMR design. However, the hybrid design has lower resource usage compared to the TMR design (see Table 4-1). For applications where BRAM or DSP48 resources are constrained, the hybrid ABFT design provides a good compromise between low vulnerability (0.012% configuration DVF) and low DSP48 usage (100% overhead) Design vulnerability of parallel architectures Although the serial implementations of the ABFT designs demonstrated a modest reduction in vulnerability, the benefits of ABFT are expected to be more visible in parallel implementations. With multiple processing elements, faulty elements will only affect a subset of the output data, improving fault localization. Figures 4-7 and 4-8 show the number of vulnerable bits for each design while varying the number of processing elements. In each of the baseline designs, the number of vulnerable bits increases linearly with the amount of parallelization. Intuitively, as the total area required for the design increases, the number of vulnerable bits increases proportionally. The coarse-grained design has a higher vulnerability per processing element than the fine-grained design due to larger resource requirements. As more processing elements are used, the vulnerability of the ABFT designs increases, up to the 16 PE design. The rate of increase is much slower rate than the baseline designs, and therefore, the improvement in reliability is positively correlated to the amount of parallelism employed. Additionally, in the fully-parallel 32-PE designs, the ABFT designs exhibit their highest reliability. When fully parallelized, the control logic for the MM calculation is significantly simplified, resulting in less system hangs. For each 75

76 of the parallel implementations, the ABFT-Shared design was more vulnerable than the ABFT-Extra design, with 20%-50% more vulnerable bits. Meanwhile, there was not a significant difference in vulnerability between the coarse-grained and fine-grained ABFT designs. Any benefits from fault localization in the coarse-grained design were offset by the increased resource requirements. As expected, the TMR design has significantly fewer vulnerable bits than any of the other designs. Additionally, as the number of processing elements scales up, the number of vulnerable bits stays constant. Finally, the hybrid ABFT design has a similar fault vulnerability as the ABFT-Extra design upon which it was based. The fine-grained hybrid design has lower vulnerability for implementations with up to 4 processing elements. For larger designs, the vulnerabilities are similar. For the coarse-grained designs, the hybrid design has 5%-15% fewer vulnerable bits than the ABFT-Extra design. In both cases, the number of false positives and system hangs are up to 70% lower in the hybrid design Analysis of Matrix-Multiplication Architectures The DVF of the serial matrix multiplication designs tested in this section are very low, due to the small resource requirements in comparison to the size of the FPGA being used. However, the effectiveness of each design is readily measurable. While the TMR design is the most reliable design, the serial hybrid ABFT design does reduce the number of vulnerable bits by 56% over the baseline design. For the 32-MAC parallel design, the fine-grained hybrid ABFT design has 98% fewer vulnerable bits than the baseline design, while only incurring 25% slice overhead, 3.125% DSP48 overhead, and 0% BRAM overhead. The observed errors were categorized into three types: silent data errors, system hangs, and false positives. While silent data errors must be reduced as much as possible, system hangs and false positives are correctable within a system-level reliability framework. Each of the tested designs, including the TMR design, experienced 76

77 system hangs which prevented the system from returning data. In order to prevent system hangs, an external watchdog timer may be employed to reset the system, reconfigure the FPGA, and resume processing. Additionally, each of the ABFT designs experienced false positive results. False positive results cannot be determined during run-time, but the error flag will trigger the recovery procedure, causing an FPGA reconfiguration and re-computation of the last data set. By reducing false positives, the performance overhead incurred by recomputing data can be reduced. The hybrid ABFT design experienced up to 70% fewer false positives than the ABFT-Shared or ABFT-Extra designs. Additionally, an examination of result matrices with data errors revealed that many faulty matrices contained all zero values, producing an incorrect result with a valid (zero) checksum. It may be possible to further improve the reliability of the ABFT designs by modifying the ABFT encoding matrix to ensure that output matrices cannot contain all zeros. 4.2 Fast Fourier Transform The Fast Fourier Transform (FFT) is another key kernel in many space-based applications, such as synthetic-aperture radar and beamforming. While the FFT can be computed using tradition general-purpose processors or DSPs, the FFT has an efficient, high-throughput FPGA implementation. Unlike matrix multiplication, the FFT does not naturally fit into the traditional ABFT framework. In Section 4.2.1, the mathematical description of our block-based ABFT approach for the FFT is presented. In Section 4.2.2, multiple FFT architectures are discussed for use within a generic ABFT-enabled FFT architecture. Section analyzes the resource overhead tradeoffs for each of the proposed architectures. Section presents an analysis of fault-injection results obtained from each of the proposed architectures. Finally, Section summarizes the results from the FFT analysis. 77

78 4.2.1 Checksum-Based ABFT for FFTs Since the Fourier Transform is a linear operator, the properties of linearity can be used to show that ABFT techniques can be applied to the Fast Fourier Transform. For two vectors a and b, the following equality holds for all linear operators (including the FFT): F (a + b) = F (a) + F (b) (4 5) Using this property we can create a block-based ABFT technique that can be used to detect errors in blocks of FFTs. Instead of performing detection on each input vector, we form a matrix A composed of individual signal vectors (x0, x1, x2,...). Following the approach in Section 4.1, we can then augment this matrix with an additional checksum row by using the same encoding vector E N as in Equation 1. A C = x0 x1 A =. x N 2 A E T N A x N 1 = A N 1 i=0 x i (4 6) (4 7) By performing an FFT on each row of the augmented matrix A C, we produce an output matrix B C which contains the transformed values of each input vector and 78

79 the transformed checksum vector. The transformed checksum vector will be a valid checksum for the output matrix, as shown in Equation 8. F (x0). F (A C ) = F (x N 1) F N 1 ( i=0 x i ) X0. B C = X N 1 N 1 i=0 X i (4 8) (4 9) Using the property of linearity described in Equation 5, the checksum rows of F (A C ) and B C are equivalent. In the event of an error during computation, the error will be detectable, although more complex encoding vectors are required to detect which vector is faulty. The absolute overhead of using this ABFT technique is dependent on the number of vectors used per ABFT block. Given N vectors of M elements, the additional overhead of checksum generation and verification is O(NM), while the additional core computation is O(Mlog(M)). As a function of ABFT block size, the relative overhead of ABFT is O(1/N) FFT Architectures Multiple FFT modules were created to examine the reliability of hardware-based ABFT. Unlike the previous matrix-multiplication architectures, this analysis uses pre-built, single-precision floating-point operators to construct each design. Commonly, pre-existing IP is used to reduce development costs by improving design and verification time. The FFT operators and floating-point arithmetic operators used in the following sections are pre-existing IP generated using Xilinx CoreGen tool [Xilinx 2013a]. The 79

80 Xilinx FFT operator has several possible architectures including a high-performance pipelined architecture and a minimal-hardware serial architecture which will be compared in this section. The ABFT logic for each design was generated by combining these pre-existing IP cores to perform the checksum operations. This section discusses trade-offs of each FFT architecture and the design decisions that were made for the ABFT architecture Radix-2 Burst-IO FFT architecture The basic, serial architecture for the fast Fourier Transform function consists of a single butterfly operator (complex multiplication and addition), an address generation module, and two data storage modules (BRAM) as shown in Figure 4-9(a). One memory is used to store the input matrix A, and one memory is used for the resulting output matrix B. The address generator iterates through the correct matrix indices, sending data stored in the input BRAMs to the FFT operator, and generates the appropriate address for output values. This FFT module can be used for any size matrix as long as enough data storage is supplied. For an M-point FFT, this architecture requires O(Mlog(M)) cycles to calculate an output vector Radix-2 Pipelined FFT architecture The pipelined FFT architecture has the same high-level architectural diagram as in Figure 4-9(a). However, internally it uses log(m) butterfly operators to enable high throughput. The latency of the pipelined architecture is approximately O(2M) cycles, and the time to compute N vectors is O(NM) cycles, resulting in a significant improvement in processing speed while calculating multiple, consecutive FFTs Architectural modifications for ABFT The addition of ABFT logic requires the creation of two functions, ABFT checksum generation and ABFT checksum verification, each requiring a floating-point accumulator. Unlike the matrix-multiplication case study, the ABFT checksum logic is significantly different from the algorithm s main computation, and the logic must be implemented 80

81 separately. Checksum generation sums each column of matrix A using a floating-point accumulator and writes the checksum into a checksum BRAM. After the FFT operation, the checksum verification function sums the columns of matrix B and writes the value to a checksum BRAM. Then, the checksum verification function uses a floating-point comparison operator to compare checksum values in the checksum BRAMs. If a mismatch is detected, an error found signal is asserted until the module is reset. In order to reduce the BRAM requirements, the last row of the matrix A and B BRAMs may be reserved for checksums. In this case, the initially generated checksum will be written to the matrix A BRAM and the verification checksum will overwrite that checksum value after computation is complete. Figure 4-9(b) shows an example of an ABFT-enabled FFT architecture where checksums are stored within the matrix A and B BRAMs. Due to the nature of floating-point arithmetic, the calculated checksums may not be exact. Quantization error is introduced due to the order of additions as well as the internal rounding within the FFT module. For most applications using FFTs, such as image processing, data errors that occur in low-significance bits may be safely ignored, and errors that avoid detection should not significantly impact the higher-level application. In order to handle errors in the highly significant bits, ABFT detects errors by comparing the difference of the two generated checksums to a user-defined threshold value. For maximum coverage, this threshold should be set as close to zero as possible. Unfortunately, determination of the threshold value is both algorithm-specific and data-specific. In order to estimate the required threshold values while calculating blocks of FFTs, MATLAB simulations were used to measure the maximum quantization error encountered while using various input data sets. For each simulation, test vectors were generated using a uniform random distribution on the interval ( x 2, x 2 ), where x is the range of the input data. Figure 4-10 shows the required threshold to avoid false positives for the block-based ABFT for FFT while varying the range of the input data. The observed linear trend simplifies the selection of an acceptable threshold value. 81

82 4.2.3 Resource-Overhead Experiments In this section we analyze the overhead of the ABFT architectures presented in Section and compare them to traditional fault-tolerance mitigation strategies. For this analysis, 64-point FFTs will be used, enabling up to 16 complex input vectors to fit within two BRAM components. The floating-point Burst-IO and Pipelined FFT modules are generated using the Xilinx CoreGen utility. We compare the two baseline FFT architectures with several fault-tolerant designs (TMR, ABFT, and hybrid TMR/ABFT). The results of this comparison are shown in Table Resource overhead of Burst-IO architecture In the baseline Burst-IO architecture, 4 BRAMs are used to store input and output data. The FFT module uses an additional 4 BRAMs to hold intermediate values during processing. Additionally, 8 DSP48 units are used for the FFT butterfly operation computation. The ABFT design requires 4 additional DSP48 units and 260 slices for the floating-point addition, floating-point comparison, and control logic functionality, representing 42% slice overhead and 50% DSP overhead. Unlike the MM case study, the FFT TMR design has approximately 200% overhead of each of the slice, BRAM, and DSP resources. Since the FFT components are assembled from pre-built netlists produced by CoreGen, little optimization is performed during synthesis, resulting in the expected amount of resource overhead. In the hybrid FFT design, each of the state machines controlling FFT or ABFT components in the design are protected using TMR while the FFT and ABFT floating-point operators remain unchanged. The hybrid design has 48% slice overhead and 50% DSP overhead. For larger FFTs (e.g., 128+ points), the Burst-IO DSP48 resource requirements remain constant while additional BRAMs are required for intermediate result storage. The 50% resource overhead of the ABFT design is independent of the parameters of the FFT operator. 82

83 Resource overhead of Pipelined architecture In the baseline Pipelined architecture, 4 BRAMs are used to store input and output data. The FFT module uses 16 DSP48 units for the FFT butterfly computation and one additional BRAM to hold intermediate values during processing. The Pipelined FFT uses almost twice as many slices and DSP48s, but fewer BRAMs than the Burst-IO version. The ABFT design increases the resource requirements by the same amount as the Burst-IO design (i.e.: 4 DSP48, 260 slices). Since the Pipelined design is approximately twice as large as the Burst-IO design, the overhead is correspondingly lower at 23% slice overhead and 25% DSP overhead. The TMR design has approximately 200% overhead of each of the slice, BRAM, and DSP resources. The Pipelined hybrid design uses the same methodology as the Burst-IO hybrid design and has only 26% slice overhead and 25% DSP overhead. For larger FFTs (e.g., 128+ points), the size of the Pipelined design increases because more pipeline stages and butterfly operators are required. As the size of the FFT increases, the ABFT requirements remain static, and the ABFT resource overhead is reduced. For example, a 1024-point FFT would require 34 DSP48 components and the 4 DSP48 components for the ABFT checksums will only add 12% overhead to the baseline design FFT Fault Injection In order to validate our ABFT FFT designs, faults must be injected into an executing system. In this section, we present FPGA fault-injection results gathered using the methodology described in Section For the ABFT designs, the value for the checksum error threshold was calculated from the input test vectors during a fault-free processing run. Table 4-4 shows the fault-injection results for each of the fault-tolerant FFT designs from Section For each design, the table indicates the number of injections performed, the number of undetected data errors, and system hangs detected. The measured percentage of faults is then scaled to the size of the injection area to 83

84 estimate the total number of vulnerable bits. In this analysis, vulnerable bits are the total number of bits which cause silent (undetected) data errors or system hangs. For ABFT designs, bits resulting in false-positive results were not considered vulnerable, since they do not allow faulty data to be propagated to back to the host system. The Burst-IO baseline design has an estimated 48,012 vulnerable configuration bits and a configuration DVF of 0.084%. The total DVF includes an additional 131,072 memory bits in BRAM, increasing the baseline design s total DVF to 0.314%. In the baseline design, 80% of the configuration memory fault cause data errors while 20% cause system hangs. The Burst-IO ABFT design reduces the total number of vulnerable configuration bits by 80% to 9,624. Additionally in the ABFT design, the Matrix A BRAM data bits are only vulnerable before the checksum value has been computed and the Matrix B BRAM values are always protected, improving the total DVF of the design to 0.125%. The most dangerous type of error, silent data errors, are reduced by 95% over the baseline design. The distribution of error types in the ABFT design is reversed from the baseline design s distribution; 80% of configuration faults result in system hangs while only 20% cause data errors. The hybrid ABFT/TMR design reduces the design vulnerability by an additional 30%, to only 6,662 vulnerable bits. The hybrid design decreases the occurrence of system hangs more than data errors. Finally, the TMR FFT design has 3,278 vulnerable bits, 94% fewer vulnerable bits than the baseline design, while protecting all BRAM data bits. The Pipelined baseline design is much more vulnerable than the Burst-IO counterpart with 182,192 vulnerable configuration bits (0.320% configuration DVF). The increased vulnerability comes from the larger resource requirements and increased routing complexity of the pipelined architecture. The Pipelined ABFT design reduces data errors by 97% but only reduces system hangs by 33%. The Pipelined ABFT design s error-type distribution is more skewed than the Burst-IO case; 85% of errors result in system hang and 15% result in data errors. The Pipelined hybrid ABFT/TMR 84

85 design reduces the design vulnerability by an additional 20%, to 22,251 vulnerable configuration bits. The hybrid design decreases the occurrence of system hangs more than data errors. Finally, the pipelined TMR FFT design has 9,306 vulnerable bits, 95% fewer vulnerable configuration bits than the baseline design, while also protecting all BRAM data bits Analysis of FFT Architectures The FFT architectures examined in this section are significantly more complex than the matrix-multiplication architectures of Section 4.1. Due to algorithm complexity and the use of floating-point precision operators, the resource requirements are much higher. For example, the baseline Burst-IO design is 300% larger than the baseline MM design. Additionally, the floating-point precision also increased the resource requirements of the ABFT logic. However, even with the smaller Burst-IO architecture, the overhead of the ABFT design was only 50%, significantly better than the TMR alternative. For the pipelined FFT designs, larger FFTs do not increase the resources required by the ABFT component, further improving resource overhead. In each of the FFT designs, the number of vulnerable bits that cause system hangs is much higher than in the MM case studies. The generated FFT components have substantially more control logic than the MM designs, which can lead to system hangs when upset. In the baseline FFT designs, most vulnerable bits result in corrupt data but the proportion of vulnerable bits causing system hangs is higher than in the MM case studies. In each of the fault-tolerant designs, more vulnerable bits result in system hangs than corrupt data. While the hybrid design does not significantly reduce data corruption, it does lower the occurrence of system hangs. Although TMR provides the highest reliability designs, ABFT is effective in reducing undetected data errors. If a system-level mechanism (e.g., watchdog timer) is used to recover from system hangs, the ABFT designs can approach the reliability of TMR. 85

86 4.3 Conclusions In this chapter, we have presented a novel analysis of ABFT for low-overhead fault tolerance in FPGA systems. Several matrix-multiplication and FFT designs employing TMR and ABFT fault-tolerance techniques were developed and tested using an FPGA fault-injection tool. The results demonstrated that ABFT was capable of reducing the number of vulnerable configuration bits in a design while also protecting most memory bits. While the TMR fault-mitigation approach had the lowest vulnerability, a hybrid ABFT/TMR design approach was able to lower configuration vulnerability while maintaining low overhead. With matrix multiplication, configuration vulnerability was reduced by 90% in highly-parallel ABFT architectures while using less than 20% slice and DSP48 resource overhead. For FFT architectures, configuration DVF was lowered by 85% while using 50% or less additional resources. As the size and parallelism of each of the ABFT architectures increase, the relative effectiveness of ABFT also increases. Matrix Multiplication and Fast Fourier Transforms are the key kernels in many space-based processing systems. The ABFT architectures described in this work are generic and can be applied to many of these systems. Additionally, other linear operations, such as LU or QR decomposition, can also be protected using ABFT. The vulnerability results show that ABFT can be used as part of a comprehensive fault-tolerance strategy for space-based FPGA systems. ABFT detects most data errors that occur during processing. The use of ECC techniques (e.g., Hamming, parity) can be used to protect data memory for minimal overhead. A higher-level watchdog mechanism can be used to recover from system hangs and crashes. Many of these techniques, in combination with TMR, are already in use in space systems. By selectively removing replicated components and using ABFT in their place, space systems can increase their processing capabilities without sacrificing reliability. 86

87 Table 4-1. Resource utilization and overhead of serial MM designs. Fault Slice Slice BRAM BRAM DSP48 DSP48 tolerance count overhead count overhead count overhead None ABFT-Shared % 3 0% 3 0% ABFT-Extra % 3 0% 6 100% TMR % 9 200% 9 200% Hybrid % 3 0% 6 100% Table 4-2. Serial matrix multiplication fault-injection results. Design name Faults Data System Vulnerable Configuration Total Injected Errors Hangs Config Bits (est.) DVF (%) DVF (%) Baseline 100,000 3, , % 0.198% ABFT-Shared 100,000 1, , % 0.127% ABFT-Extra 100,000 1, , % 0.127% TMR 100, , % 0.002% Hybrid ABFT 100, , % 0.123% Table 4-3. Resource utilization and overhead of FFT designs. Architecture Fault Slice Slice BRAM BRAM DSP48 DSP48 tolerance count overhead count overhead count overhead Burst-IO None Burst-IO ABFT % 8 0% 12 50% Burst-IO TMR % % % Burst-IO Hybrid % 8 0% 12 50% Pipelined None Pipelined ABFT % 5 0% 20 25% Pipelined TMR % % % Pipelined Hybrid % 5 0% 20 25% Table 4-4. FFT fault-injection results. Design Name Faults Data System Vulnerable Configuration Total injected errors hangs config bits (est.) DVF (%) DVF (%) Burst-IO - Baseline 100,000 1, , % 0.314% Burst-IO - ABFT 100, , % 0.125% Burst-IO - TMR 100, , % 0.006% Burst-IO - Hybrid 100, , % 0.120% Pipelined - Baseline 100,000 6,891 1, , % 0.550% Pipelined - ABFT 100, ,112 27, % 0.156% Pipelined - TMR 100, , % 0.016% Pipelined - Hybrid 100, , % 0.147% 87

88 0 N 2 0 N 2 RAM (Matrix A) RAM (Matrix B) Address Generator 0 N 2 RAM (Matrix C) 0 N 2 /2 0 N 2 /2 N 2 /2 N 2 N 2 /2 N 2 RAM (Matrix A) RAM (Matrix B) RAM (Matrix A) RAM (Matrix B) Address Generator Adder Tree 0 N 2 RAM (Matrix C) RAM (Matrix B) 0 N 2 /2 RAM (Matrix A) Address Generator A B C N 2 /2 N 2 RAM (Matrix A) 0 N 2 /2 N 2 /2 N 2 RAM (Matrix C) RAM (Matrix C) Figure 4-1. Matrix-multiplication architectures. A) Baseline. B) Fine-grained parallel. C) Coarse-grained parallel. 1 for ( i = 0; i < N; i ++) 2 { 3 for ( j = 0; j < N; j ++) 4 { 5 C[ i ] [ j ] = 0; 6 for ( k = 0; k < N; k++) 7 { 8 C[ i ] [ j ] += A [ i ] [ k ] B [ k ] [ j ] ; 9 } 10 } 11 } Figure 4-2. Matrix-multiplication psuedocode. 0 N 2 0 N 2 RAM (Matrix A) RAM (Matrix B) Encoding Matrix Address Generator 0 N 2 RAM (Matrix C) Figure 4-3. Matrix-multiplication ABFT-Extra architecture. 88

89 Figure 4-4. Slice overhead of fine-grained parallel matrix multiplication. Figure 4-5. Slice overhead of coarse-grained parallel matrix multiplication. 89

90 Figure 4-6. DSP48 and BlockRAM overhead of parallel matrix multiplication. 90

91 Figure 4-7. Fault vulnerability of fine-grained parallel matrix multiplication. 91

92 Figure 4-8. Fault vulnerability of coarse-grained parallel matrix multiplication. Address / Control Logic 0 N*M RAM (Matrix A) Address / Control Logic Xilinx LogiCORE FFT 0 N*M RAM (Matrix B) 0 (N+1)*M RAM (Matrix A) Xilinx LogiCORE FFT 0 (N+1)*M RAM (Matrix B) A B Figure 4-9. FFT architectures. A) Serial. B) ABFT. 92

93 3.00E E-02 Required ABFT Threshold 2.00E E E E E Range of Input Data Figure Required ABFT threshold value for floating-point FFTs. 93

94 CHAPTER 5 RFT SYSTEM INTEGRATION FOR RAPID SYSTEM DEVELOPMENT The RFT-based system presented in Section 3.1 is intended to be a reference design that can be adapted to actual space-system architectures. In this chapter, we investigate possible methods and/or tools that will improve the usability of RFT-based systems for system developers to facilitate adoption of our framework. Our approach, shown in Figure 5-1, has identified two areas of RFT system design that may be improved through external tools. The first area, generating RFT hardware for arbitrary system configurations, can be facilitated by creating multiple system templates for various interconnection architectures. Specifically, we target architectural support in order to provide compatibility with commonly used partial reconfiguration architectures, such as VAPRES [Jara-Berrocal and Gordon-Ross 2010]. The second area, software-based support for intelligent fault-tolerance mode selection, will combine fault-rate modeling and task scheduling to optimize job placement for high performability. Research into each of these areas will facilitate the adoption of the RFT framework into new and existing space systems. In this section, we outline our approach for improving hardware and software support for RFT systems. Section 5.1 discusses RFT architectural templates, integration of an RFT controller into an existing VAPRES PR-based system, and parameterized VHDL designs for RFT controller logic. RFT controller generation and system integration will reduce time required to develop RFT systems by using existing designs with little modification and dynamically creating new RFT-related components. Section 5.2 discusses reliable task scheduling, which will reduce the complexity of RFT software requirements. 5.1 Dynamically Generated RFT Components The RFT controller created for Section 3.1 was designed for a small RFT system consisting of only three PRRs and connected to a host CPU through a bus-based 94

95 interconnect. However, for maximum flexibility, the RFT framework must support systems with an arbitrary number of PRRs and interconnect types. Additionally, a system design may decide to support only a subset of the fault-tolerance modes allowed by the RFT framework. Dynamic RFT controller generation allows a system designer to specify the size and supported features of an RFT controller customized for their specific system. By disabling unneeded features, the overhead of the RFT controller can be reduced. The RFT controller can be divided in to two categories: system interconnect interfaces and fault-tolerance features. The type of system interconnect affects the structure of the Address Decoder component. The fault-tolerance features affect the Voting Logic, Output Mux, and Watchdog Timers components RFT Controller Point-to-Point Interface The RFT hardware architecture from Section 3.1 was used as an example bus-based PR architecture. However, to improve usability with pre-existing systems, it is possible to adapt the RFT controller to be used in other PR architectures. By adapting the RFT controller to be used within a point-to-point VAPRES PR system, it will be possible to quickly provide fault tolerance to pre-existing VAPRES systems. Other existing PR architectures can be classified as either bus-based or point-to-point. The initial RFT controller connects to the rest of the system through the Processor Local Bus (PLB). PRMs connected to the RFT controller must act as slave devices. Slave devices can respond to bus requests, but cannot initiate transfers. For example, the processor can read from a PRM s memory, but the PRM cannot write a value directly to the host processor s memory. The RFT controller instantiates a slave PLB controller to handle communication with the host processor. The PLB controller converts the bus signals into a simplified set of user signals. The primary difference between the system from Section 3.1 and a VAPRES-based system is the use of Fast Simplex Links (FSL) for communications between the 95

96 Microblaze processor and the PRRs. An FSL template for the RFT controller must be created in order to operate with a point-to-point architecture (VAPRES). An FSL link is a FIFO queue that directly connects the MicroBlaze with user logic. The MicroBlaze instruction set has specific intrinsic functions for reading and writing to the FSL links. These direct links from the PRMs to the Microblaze increase throughput and reduce latency from bus contention. For an RFT controller operating in high-performance mode, data from an FSL could be passed directly to its connected PRR. However, an FSL-based RFT controller must be able to send data from an arbitrary FSL to any other FSL in order to enable operate in TMR or DWC modes. For example, in TMR mode, any data written to FSL #0 must also be simultaneously written to FSL #1 and FSL #2 to ensure correct functionality. In order to accomplish this, the FSL-based RFT controller must include an additional bi-directional FIFO queue for each PRR. The RFT controller must process all incoming data, route the data to the appropriate PRM, and write any output data to the correct output FSL. This requires an additional 2 BRAMs per PRR plus the necessary control logic. Table 5-1 shows a comparison of the resource requirements for the bus-based and point-to-point RFT controllers. Since most PR-based systems can be classified as either bus-based or point-to-point, the two RFT controllers created from this work can serve as a reference design. Although alternative buses or protocols may be used in future systems, the overall architecture should remain applicable Parameterized and Configurable Voting Logic While the RFT controller s input- and output-address decoder hardware may be dependent on the interface to the larger PR system, many other components are only dependent on the number of PRRs in the design. These components, created using VHDL, can be parameterized in order to be easily adapted to arbitrarily large systems. The main parameter, NUM PRRS, scales the RFT controller interfaces to the 96

97 appropriate size for the system. The parameterized components we examine are the watchdog timers, RFT status registers, output mux, and voting logic,. The ENABLE WATCHDOG parameter, enables the creation of watchdog timer logic for each PRR in the design. Additionally, a watchdog configuration register is created for each PRR. For PRMs which do not wish to use the watchdog functionality, the configuration register can be used to disable the timer. NUM PRRS status registers are created. There are several parameters related to the creation of voting logic within the RFT controller. The ENABLE TMR and ENABLE DWC creates TMR or DWC voters, respectively. One TMR voter is created for every three PRRs, and one DWC voter is created for every two PRRs. By not using ( ) ( NUM PRRS 3 or NUM PRRS ) 2 voters, respectively, we reduce the total number of required resources at the expense of some flexibility. Based on the parameterized voters, the output multiplexer can then select from the complete enumeration of the possible RFT-mode combinations. These simple parameters allow for the creation of a scalable RFT controller design. By enabling or disabling specific fault-tolerance features, the RFT controller can be targeted and optimized for specific architectures or applications. As system size increases, the parameterized design automatically creates the logic required for additional PRRs. 5.2 Task Scheduling for RC Systems in Dynamic Fault-Rate Environments In order to optimize a system for performance and reliability, the fault-tolerant mode must be selected carefully based upon the current fault conditions experienced by the system. In this section, we propose a fault-tolerant scheduler that can schedule real-time tasks as well as select a heuristic to select a fault-tolerant RFT mode for each task. We then evaluate the effectiveness of the proposed heuristic using two case-study simulations. 97

98 5.2.1 Selection Criteria for Fault-Tolerant Mode RFT mode switching can be triggered by a priori knowledge of the operating environment, application-triggered events, or external events. In an RFT system, the expected fault rate can be estimated either directly or indirectly. An external radiation sensor can be directly interfaced with the FPGA, allowing the system to track the current fault rate and predict future fault rates. Alternatively, the RFT system can indirectly determine fault rates using models of the expected fault environment from Chapter 3. By correlating the space system s current position to an existing model, a fault-rate estimate can be used to make scheduling decisions. In the following sections we assume that tasks can be scheduled with no fault tolerance (Simplex), duplication with compare (DWC), or triple-modular redundancy (TMR). Data errors in tasks are detected by comparing or voting on the output of each task replica. We also assume that the system can estimate the current fault environment using a pre-existing model of the system s orbit FT-mode selection using thresholds One of the most straightforward methods for selecting an appropriate fault-tolerance mode is the use of thresholds. At very high fault rates, TMR is required to maintain reliability. At low fault rates, DWC or Simplex modes may provide sufficient reliability while increasing performance. At each time step, the current fault rate is measured, and new tasks are assigned based on the pre-selected rules. The optimal threshold occurs at the fault rate where the reliable performance of TMR and DWC are equal. The ideal scheduling heuristic will select TMR when the current fault rate is above the fault-rate threshold, f thresh, and will select DWC otherwise. Selecting the appropriate values for the threshold is dependent on the application and environment. Determining f thresh requires information about fault rates, task frequency, task load, and other factors which may not be static throughout a system s operation. 98

99 Time-resource metric for FT-mode selection Instead of depending upon a user-defined threshold value to determine a fault-tolerance strategy, we explore a possible metric which can estimate the optimal threshold. The metric combines computation time (τ) and fault probability of a single task (f ) in order to select between FT modes. By developing a scheduling metric that incorporates a dynamic fault rate, we intend to improve overall system performance without the need for expert user input. Given the current fault rate, f, the reliability of a task at the end of its computation time, R, is given by the following equations (depending on mode): R Simplex = (1 f ) τ R DWC = (1 f ) 2τ (5 1) R TMR = 3(1 f ) 2τ 2(1 f ) 3τ If tasks are running in DWC or TMR mode, faults are discovered at the end of the task s computation time. Faulty tasks are then rescheduled until they complete successfully. The average number of times a task must be executed in order to successfully complete is then given by the following geometric series: τ e = n=0 τ (1 R) n τ = R We define a time-resource coefficient, α, which combines the effective execution time with the required resources of a given task (α = N τ e ). Then, by comparing the α of a DWC or TMR task, we can determine which mode is optimal for reliability (lower is better). In Equation 3, we solve for conditions where DWC will provide lower α than TMR. (5 2) 2τ (1 f 3τ (5 3) ) 2τ 3(1 f ) 2τ 2(1 f ) 3τ 99

100 Simplifying Equation 3 provides the following simple relation: (1 f ) τ 3 4 (5 4) Based on the definition of α, DWC provides more reliable performance than TMR when R Simplex is greater than For low fault rates, DWC provides higher overall performance. For very high fault rates, or very long execution times, the reliability of TMR scheduling is preferred. A similar analysis can be performed for simplex tasks, however simplex scheduling has no method for detecting faults. We use this conclusion as the basis for an adaptive fault-tolerance threshold Scheduler for RFT Traditional fault-tolerant scheduling algorithms assume that the fault rate experienced by the system will be constant, and that the fault-tolerance strategy will also be constant. For an RFT-based system, a scheduler which uses the current fault rate is necessary to maximize system utilization while maintaining system availability. The fault-tolerant scheduler presented in this section can schedule tasks in any FT mode based on user-defined thresholds or the α-metric RFT architecture description The RFT system described in Chapter 3 contains a microprocessor connected to several large partially-reconfigurable regions (PRRs) through a shared system bus. During normal system operation, unique tasks can be scheduled to any of the PRRs. Depending on system configuration, the outputs of three contiguous PRRs can be voted on to provide coarse-grained TMR functionality, or two PRRs can provide DWC functionality. Each of these PRRs are large and identical in size, and can be represented with a 1D area model, reducing many of the scheduling problems presented in Section

101 5.2.3 Software Simulation In order to evaluate our scheduling technique and possible heuristics, a software-based discrete-time simulator was developed in C++. The simulator enables us to specify task arrival rates, task deadlines, dynamic fault rates, and scheduling algorithms. In addition to scheduling tasks, the simulator can also inject faults into tasks and force re-scheduling of failed tasks. Figure 5-3 shows the basic overview of how the simulator is used. At each time step, tasks are randomly added to a task pool. This process is modeled as a Poisson process with mean λ arrival. All tasks in the task pool are scheduled, if possible, and then moved to the reservation list. When multiple tasks arrive simultaneously, tasks are scheduled using an earliest-deadline-first (EDF) heuristic. The scheduler does not employ preemption; tasks are scheduled on arrival only, and new tasks must be placed around the existing schedule. Tasks which cannot be scheduled before their deadlines are rejected. If a task is scheduled to begin at the current time step, the simulator moves the task from the reserved list to the execution list. After tasks have been scheduled, faults are injected into each PRR with probability f in order to simulate the dynamic fault environment. At the end of the task s execution, the outcome of the task is determined based upon the number of faults encountered (i.e., Simplex and DWC tasks fail with 1 fault, TMR tasks fail with faults in 2 or more PRRs). Multiple faults within a single PRR have no additional effect on the system. If the task fails, the scheduler then returns the task to the task queue to be re-scheduled. All other reserved tasks (scheduled, but not yet executing) are also returned to the task queue to be rescheduled. Fault-tolerant tasks are rescheduled until they successfully complete or can no longer meet their deadline. When tasks must be re-scheduled due to faults, they are treated as a new task for scheduling purposes, although their original deadline is maintained. The fault-tolerant mode for rescheduled tasks will based on the fault rate at the time of rescheduling. 101

102 5.2.4 Analysis and Results In the following analysis, task execution times (t exec ) are uniformly distributed in [10, 100] time steps with deadlines (t deadline ) of [100, 200] time units. For simplicity, we assume that a time step is 1 second. Simplex tasks use one processing region, DWC tasks use two processing regions, and TMR tasks use three processing regions. The simulated system uses 12 processing regions (N PRRs ) in order to enable flexibility in placing TMR and DWC tasks. For each experiment, we measure the performance of each metric with the scheduler s guarantee ratio (percentage of total tasks scheduled successfully) while attempting to schedule 100,000 tasks Constant fault rates Initially, the simulator is used to get a fault-free baseline for comparison purposes. Figure 5-4 shows the effect of arrival rate on the performance of the system. At low arrival rates, Simplex, DWC, and TMR scheduling can all meet the system demand. However, arrival rates higher than 0.06 tasks per second begin to impact the schedulability of the TMR system because there are not enough resources to handle all incoming tasks. Using DWC for fault tolerance will result in higher guarantee ratios since more DWC tasks can be scheduled at any one time. The lack of a fault-tolerance mechanism excludes the use of Simplex scheduling in the presence of faults. In order to investigate the effect of fault rates on our system, we chose a constant arrival rate of tasks per second. With this arrival rate, the DWC system can successfully schedule all incoming tasks, while the TMR system cannot. Using the arrival rate in this way, we attempt to define a system which requires DWC to meet performance demands but can temporarily use TMR to meet reliability constraints. The effect on the guarantee ratio is shown in Figure 5-5. At low fault rates the DWC mode provides higher throughput, while TMR outperforms DWC at high fault rates. Our adaptive metric produces high throughput at low fault rates, closely tracking the performance of DWC, but performs between TMR and DWC at intermediate fault rates. 102

103 At higher fault rates, the guarantee ratio of the adaptive heuristic produces results close to TMR. From these constant-fault-rate results, an ideal-threshold heuristic can be determined. The crossover fault rate for TMR and DWC occurs at faults/sec Dynamic fault-rate case studies In order to get a benefit from the adaptive scheduling methods, the fault rates experienced by the system must vary. We present two fault profiles which represent patterns commonly seen in space missions. Figure 5-6 shows the fault profiles used for the following analysis, based on the fault model in [Jacobs et al. 2012]. The first profile is a sinusoidal pattern with a 90-minute period which is characteristic of fault rates in Low-Earth Orbit (LEO). The second pattern (Burst) represents Highly-Elliptical Orbits (HEO), where the system experiences low fault rates for most of the orbit, with a large burst when making the closest approach to Earth, once every 12 hours. For the following case studies, four different scheduling heuristics are examined. The TMR-only and DWC-only heuristics will schedule every task in their respective mode. The ideal-threshold heuristic will use the fault-rate threshold measured in the previous section ( faults/sec) to choose between the DWC and TMR modes. The adaptive heuristic uses Equation 5 4 to determine the FT mode for each task. Each heuristic will be evaluated using an arrival rate of tasks/sec and the same parameters used in Section For the Sinusoidal case study, fault rates are low compared to the average task execution time. For this fault profile, scheduling tasks with the DWC-only, ideal-threshold, or adaptive heuristics provide an equivalent rejection ratio, 0.2%. There are enough system resources to make re-computation of failed DWC tasks better than simply using TMR to protect against all failures. The fault rate rarely gets high enough for the threshold or adaptive heuristics to schedule tasks in TMR mode. The adaptive heuristic performs well, reducing the number of rejected tasks over the TMR-only strategy by 94%, while maintaining a low average task latency of 8 seconds per task. 103

104 In the Burst case study, fault rates are low except during a short window of time with extremely high fault rates. Unlike in the previous case study, the adaptive heuristics will benefit from the large range of fault rates and each heuristic has different performance characteristics. For this fault-rate profile, the adaptive heuristics perform the best, with 11% fewer rejected tasks than the DWC-only strategy and 48% fewer than the TMR-only strategy. Additionally, the adaptive heuristic has only 3% more rejected tasks than the ideal-threshold heuristic. TMR is only optimal when the high-fault-rate burst occurs. Otherwise, DWC will better utilize the system resources. The Burst fault profile is ideal for all dynamic metrics, since the two phases are highly separated Scheduling improvements At extreme fault rates (high or low), the adaptive heuristic will schedule all incoming tasks in the same mode. However, for moderate fault rates, both DWC and TMR tasks will be scheduled depending on task computation time. One drawback to the dynamic fault-tolerant selection metrics is the resource fragmentation that occurs when different sized objects are placed on the FPGA fabric. This effect can produce schedules similar to Figure 5-7, where fragmentation causes poor utilization of the available FPGA resources. As tasks arrive to the system, they are placed in PRRs p0 through p5, in either DWC or TMR mode. At time t6 there are enough unused PRRs in the system for a DWC task, but because the resources are not contiguous the task cannot be placed. Unfortunately, the simplistic EDF scheduling heuristic currently in use does not account for FPGA fragmentation. The low effectiveness of the adaptive metric in Figure 3 can be explained by this FPGA fragmentation. By using a placement-aware scheduler, the adaptive heuristic should perform closer to optimal for all fault rates. For example, delaying the placement of task F DWC for until t6 would enable a more compact placement in PRRs p2 and p3, enabling space for the placement of an additional DWC task. Alternatively, task F DWC could be placed in PRRs p4 and p5 at time t5, leaving room for future DWC tasks in PRRs p2 and p3. An FPGA placement-aware scheduling 104

105 algorithm such as the Horizon or Stuffing scheduler [Steiger et al. 2004] should be incorporated in order to improve the performance of the adaptive heuristic. Fault-rate lag is another possible problem. If a task is scheduled using a specific mode but does not execute for a long period of time, a different FT mode may become more appropriate. Limiting the scheduling window to only schedule a few tasks at a time may prevent this lag. Alternatively, using a prediction of the future fault rate during scheduling may reduce this effect. Finally, preemption enables the scheduler to return a currently running task to the task queue in order to start a higher priority task. By adding preemption capabilities to the scheduler, the guarantee ratio of all the tested metrics can be improved. 5.3 Conclusions In Phase 3 we have adapted the original bus-based RFT hardware architecture for use with a PR architecture based on point-to-point connections (VAPRES) and quantified the design tradeoffs. Additionally, by providing bus-based and point-to-point hardware templates, the RFT architecture can now be more easily be ported to future systems. The additional parameterization of internal TMR components allows for user customization of RFT features such as watchdog timers or voting logic configurations. We have also presented a novel metric for determining optimal fault-tolerance settings for reconfigurable fault-tolerant systems. An RFT scheduler and simulator were developed in order to test the effectiveness of the adaptive scheduling heuristic and to compare its performance to traditional static fault-tolerance strategies. When using our adaptive FT strategy in Burst-like fault environments, we maintain system reliability while reducing the number of rejected tasks by 48% compared to a static TMR fault-tolerance strategy and 11% compared to static DWC. In the Sinusoidal case study, the static DWC, Ideal, adaptive heuristics reduce the number of rejected tasks by 94% compare to static TMR strategy. We have demonstrated that the adaptive heuristic performs similarly to 105

106 an optimal user-defined threshold, without the need for detailed system simulation and measurement. Table 5-1. RFT controller resource usage. Module name Slices used BRAMs used PLB Controller FSL Controller Table 5-2. Dynamic scheduling results. Case study FT metric Guarantee ratio Reject ratio Avg. latency (s) Sinusoidal TMR-Only Sinusoidal DWC-Only Sinusoidal Ideal Sinusoidal Adaptive Burst TMR-Only Burst DWC-Only Burst Ideal Burst Adaptive Figure 5-1. Research areas for Phase

TMR Components System Interconnect (PLB) Microprocessor (MicroBlaze) RFT Controller Memory Controller I/O Ports (UART, USB) ICAP TMR Components Microprocessor (MicroBlaze) RFT Controller PLB Memory

107 TMR Components System Interconnect (PLB) Microprocessor (MicroBlaze) RFT Controller Memory Controller I/O Ports (UART, USB) ICAP TMR Components Microprocessor (MicroBlaze) RFT Controller PLB Memory Controller I/O Ports (UART, USB) ICAP PRR 1 PRR 2 PRR N 1 PRR N PRR 1 PRR 2 PRR N 1 PRR N A B Figure 5-2. Comparison of RFT architectures. A) PLB-based architecture. B) FSL-based architecture. Figure 5-3. Flowchart of scheduling simulator. 107

108 (N PRRs = 12, t exec ϵ[10, 100], t deadline ϵ[100, 200]) Figure 5-4. Effect of arrival rate on fault-free operation. (N PRRs = 12, t exec ϵ[10, 100], t deadline ϵ[100, 200], λ arrival = 0.075) Figure 5-5. Effect of fault rate on task rejection. 108

109 Figure 5-6. Fault-rate profile for case studies. 109

110 Figure 5-7. Resource fragmentation from adaptive placement. 110

Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing

Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing ADAM JACOBS, GRZEGORZ CIESLEWSKI, ALAN D. GEORGE, ANN GORDON-ROSS, and HERMAN LAM, University