A Low-Latency DMR Architecture with Efficient Recovering Scheme Exploiting Simultaneously Copiable SRAM

Similar documents
Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

A CAN-Based Architecture for Highly Reliable Communication Systems

P2FS: supporting atomic writes for reliable file system design in PCM storage

An Integrated ECC and BISR Scheme for Error Correction in Memory

Survey on Stability of Low Power SRAM Bit Cells

FeRAM Circuit Technology for System on a Chip

Ultra Depedable VLSI by Collaboration of Formal Verifications and Architectural Technologies

SAFETY-CRITICAL applications have to function correctly

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Low-Power Technology for Image-Processing LSIs

Single Event Latchup Power Switch Cell Characterisation

FAULT TOLERANT SYSTEMS

Dependability tree 1

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy

Embedded SRAM Technology for High-End Processors

SPECIAL ISSUE ENERGY, ENVIRONMENT, AND ENGINEERING SECTION: RECENT ADVANCES IN BIG DATA ANALYSIS (ABDA) ISSN:

INTERCONNECT TESTING WITH BOUNDARY SCAN

Fault Tolerance in Parallel Systems. Jamie Boeheim Sarah Kay May 18, 2006

Design of Low Power Wide Gates used in Register File and Tag Comparator

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry

PDH Switches. Switching Technology S PDH switches

Organic Computing DISCLAIMER

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

A 10T Non-Precharge Two-Port SRAM for 74% Power Reduction in Video Processing

Could We Make SSDs Self-Healing?

HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths

Parallel Streaming Computation on Error-Prone Processors. Yavuz Yetim, Margaret Martonosi, Sharad Malik

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

ARCHITECTURE DESIGN FOR SOFT ERRORS

Fault-tolerant techniques

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Development of middleware applicable to various types of system structure in train-traffic-control systems

Fault Tolerance. The Three universe model

Core Research for Evolutional Science & Technology (CREST) "Fundamental Technologies for Dependable VLSI Systems (DVLSI)"

A Design of Fail-safe Gateway-embedded System for In-vehicle Networks

Issues in Programming Language Design for Embedded RT Systems

International Journal of Advance Engineering and Research Development LOW POWER AND HIGH PERFORMANCE MSML DESIGN FOR CAM USE OF MODIFIED XNOR CELL

Simulation and Analysis of SRAM Cell Structures at 90nm Technology

DECOUPLING LOGIC BASED SRAM DESIGN FOR POWER REDUCTION IN FUTURE MEMORIES

Single Event Upset Mitigation Techniques for SRAM-based FPGAs

Downtime Prevention Buyer s Guide. 6 QUESTIONS to help you choose the right availability protection for your applications

Defect Tolerance in VLSI Circuits

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Proactnes II: Visualization for Next Generation Networks

READ STABILITY ANALYSIS OF LOW VOLTAGE SCHMITT TRIGGER BASED SRAM

Enabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters

11. SEU Mitigation in Stratix IV Devices

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation

Ultra Low-Cost Defect Protection for Microprocessor Pipelines

Abstract. 2. Cooperative Voltage Scaling. 1. Introduction

Reliability of Memory Storage System Using Decimal Matrix Code and Meta-Cure

ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS

Chapter 7 CONCLUSION

An Energy-Efficient Scan Chain Architecture to Reliable Test of VLSI Chips

International Journal of Scientific & Engineering Research, Volume 5, Issue 2, February ISSN

Self-checking combination and sequential networks design

LOW POWER SRAM CELL WITH IMPROVED RESPONSE

Improved Fault Tolerant Sparse KOGGE Stone ADDER

An Orthogonal and Fault-Tolerant Subsystem for High-Precision Clock Synchronization in CAN Networks *

hot plug RAID memory technology for fault tolerance and scalability

Memory Systems IRAM. Principle of IRAM

Time-Triggered Ethernet

CS250 VLSI Systems Design Lecture 9: Memory

William Stallings Computer Organization and Architecture 8th Edition. Chapter 5 Internal Memory

Intel iapx 432-VLSI building blocks for a fault-tolerant computer

A Low-Cost SEE Mitigation Solution for Soft-Processors Embedded in Systems On Programmable Chips

Improving Memory Repair by Selective Row Partitioning

FPGA Implementation of Object Recognition Processor for HDTV Resolution Video Using Sparse FIND Feature

COSC 6385 Computer Architecture - Memory Hierarchies (II)

A Behavior Based File Checkpointing Strategy

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

A Low-Cost Correction Algorithm for Transient Data Errors

250nm Technology Based Low Power SRAM Memory

FOR IMMEDIATE RELEASE

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

SigmaRAM Echo Clocks

A Robust Bloom Filter

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM

Implementation of DRAM Cell Using Transmission Gate

Last Class:Consistency Semantics. Today: More on Consistency

On-chip Fault-tolerance Utilizing BIST Resources

Quantum Chargers Enhanced AC Line Transient Immunity

Leso Martin, Musil Tomáš

A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy

White paper PRIMEQUEST 1000 series high availability realized by Fujitsu s quality assurance

Rediffmail Enterprise High Availability Architecture

Impact of Divided Static Random Access Memory Considering Data Aggregation for Wireless Sensor Networks

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

Page Mapping Scheme to Support Secure File Deletion for NANDbased Block Devices

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

Lecture 22: Fault Tolerance

Outline of Presentation Field Programmable Gate Arrays (FPGAs(

Distributed Systems

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007

DESIGN OF LOW POWER 8T SRAM WITH SCHMITT TRIGGER LOGIC

Transcription:

A Low-Latency DMR Architecture with Efficient Recovering Scheme Exploiting Simultaneously Copiable SRAM Go Matsukawa 1, Yohei Nakata 1, Yuta Kimi 1, Yasuo Sugure 2, Masafumi Shimozawa 3, Shigeru Oho 4, Hiroshi Kawaguchi 1, and Masahiko Yoshimoto 1,5 1 Graduate School of System Informatics, Kobe University, Kobe, Japan 2 Central Research Laboratory, Hitachi, Ltd., Yokohama, Japan 3 Hitachi Solutions, Ltd., Yokohama, Japan 4 Department of Electrical and Electronics Engineering, Nippon Institute of Technology, Minamisaitama, Japan 5 Japan Science and Technology Agency (JST) CREST, Tokyo, Japan Abstract This paper presents a novel architecture for a fault-tolerant high-performance system using a checkpoint/restart approach with dual modular redundancy (DMR). The proposed architecture can perform low-latency copy with instantaneously copiable SRAM. Furthermore, we can use an instantaneous comparison scheme that has more fault coverage than comparison with a cyclic redundancy check (CRC). Evaluation results show that, compared with the conventional checkpoint/restart DMR, the proposed architecture reduces the cycle time by 97.8% and achieves a 3.28% low-latency execution cycle even if a one-time fault occurs when executing the task. Keywords dual modular redundancy; checkpointing; faulttolerance I. INTRODUCTION Microprocessors are a key component used for widely various applications. Processors used in safety-critical systems such as vehicles and social infrastructure must operate with high reliability. Nevertheless, processors are increasingly sensitive to soft errors and power supply noise with technology scaling, which increases threshold voltage (V th ) variation because of random dopant fluctuation. The V th variation degrades processor reliability. These factors engender faults in the processor. Consequently, reliable processors are necessary to detect faults and to recover from a faulty state even if faults occur in the processor. To detect faults, Dual Modular Redundancy (DMR) with a recovery scheme has been applied to reliable processors [1 3]. Two cores in the DMR processor execute the same task simultaneously in parallel. The processor detects faults by comparing states of the cores (register values and stored data in the memory) at every checkpoint interval. If the states match, then the register values of the cores are copied into backup shadow registers for subsequent recovery. If the states differ, then the cores load the shadow register values to recover the recent fault-free state. The checkpoint and recovery technique enables detection of faults and recovery from a faulty state. However, the copy of register values at the checkpoint and loading of these values for recovery impose additional latency. Unless the latency is low, the technique cannot be applied to systems such as vehicle control systems. A conventional DMR architecture uses a Cyclic Redundancy Check (CRC) code for comparison and bus transfer for the copying of register values [4]. Fault-detection capability of the comparison of CRC code depends on the fault location. The bus transfer is processed sequentially. The bus transfer latency increases along with the number of registers. Herein, we propose a DMR architecture that has a low latent recovery scheme. To realize low latency, the proposed DMR architecture exploits block-level simultaneously copiable SRAM. In addition, exploiting block-level instantaneous comparison SRAM improves the fault detection capability. II. PROPOSED DMR ARCHITECTURE WITH A RECOVERY SCHEME Figure 1. Proposed DMR Architecture with recovery scheme. In this section, we present an overview of the proposed DMR architecture with a recovery scheme. Then the checkpoint/restart scheme is explained. Finally we explain differences from conventional DMR. Figure 1 portrays the proposed DMR architecture, which consists of two cores, an instantaneous comparison buffer, and a DMR controller. The proposed architecture calculates or controls using working

memory, store queue, and working register data in normal operation. Then, output data are written to the store queue or working register, and are written simultaneously in the instantaneous comparison buffer. As shown in Figure 2, normal operations begin during checkpoint intervals. The instantaneous comparison buffer compares both replicas of the register and queue in the core. Figure 3, depicts a data transfer between Store Queue and Working Memory. The comparison result is transferred to the DMR controller. If results of the comparison indicate a match, then data of the store queue are written in working memory in the next normal operation as shown in Figure 3. Data of the working register are copied in the shadow register using simultaneous data copy. If the result of comparison is a mismatch, then the store queue data are erased and data of the shadow register are transferred to the working register. Then the process restarts from the last checkpoint. Conventional copying between the working register and shadow register is executed via the shared bus, but latency rises proportionately to the number of registers. Moreover, normal comparison data between the dual cores is performed via the bus, so latency increases along with the data size. Although the comparison of CRC is also often used to reduce the number of comparison cycles, CRC might be unable to detect multibit error. The proposed architecture is valid for transient fault such as soft error. On the other hand, permanent fault can not be covered because if permanent fault is occurred this architecture will repeat fault detection and recovery during execution. Figure 3. Date between Store Queue and WorkingMemory: comparison match and comparison mismatch III. INSTANTANEOUS COMPARISON AND SIMULTANEOUS COPY WITH 7T MEMORY CELL The proposed DMR architecture has instantaneous comparison and a simultaneous copy scheme structure with 7T SRAM. Herein, we explain the instantaneous comparison and the simultaneous copy scheme. A. 7T SRAM Figure 2. Execution with checkpoint and recovery. Figure 4. Schematic of 7T bitcell pair. Figure 4 presents a schematic of a doubled 7T SRAM cell, which is based on a simultaneous copy scheme and instantaneous comparison mechanism [5]. The 7T bitcells mutually connect their internal nodes using pmos transistors. The area overhead of 7T bitcell is greater than that of the conventional 6T bitcell by 11%. When one bit of data is stored in the two bitcells, the 7T SRAM can improve the reliability because write margin and read margin of SRAM are improved.

B. Instantaneous comparison C. Simultaneous copy scheme Figure 6. Simultaneous copy scheme. Figure 5. Instantaneous comparison structure: instantaneous comparison feasible 7T bitcell pair and deployment instaneous comparison buffer. Figure 5 shows a schematic of the 7T SRAM, which can achieve high-speed comparison. In comparing data, the connecting pmoses are turned on by lowering the CTRL signal [6]. If a 7T bitcell pair retains different data, then supply current flows into a ground. However, if it retains the same data, then the supply current does not draw through a bitcell pair because no current path exists. Therefore, results of the comparison depend on whether the supply voltage is high or not. Moreover, a comparison is made instantly by deploying a replica of the register and queue as shown in Figure 5. It takes four cycles to produce an instantaneous comparison independent of the size of the comparison buffer. The cycle time by normal comparison via the shared bus becomes longer as the number of registers increases. Although comparison with the CRC can accomplish comparison in two cycles, the proposed comparison scheme has higher fault coverage than comparison with the CRC. Furthermore, the instantaneous comparison scheme consumes 99.79% less energy than comparison with a CRC comparison circuit [6]. However, drawback of this comparison scheme is that area overhead increases accordingly size of the instantaneous comparison buffer. The simultaneous copy scheme is extremely effective for the reduction of latency for the checkpoint and recovery state. Figure 6 depicts the working register value back up into the shadow register at the block-level using the simultaneous copy scheme. Figure 7 depicts a schematic of the simultaneous copiable 7T SRAM that performs data copying between the bitcells [7]. The 7T bitcells connect their internal node mutually with pmos transistors in the same way as that for 7T SRAM, as explained briefly in section 3.B. For the case in which a datum in bitcell A is copied to bitcell B (upper to lower), first, the datum in bitcell B is destroyed and prepared for copying. Then, the datum in bitcell A is copied to bitcell B through connecting pmoses. All 7T bitcell pairs can simultaneously back up the checkpoint data or restore them without a shared bus. Consequently, this 7T SRAM can produce a high-speed copy that requires only four cycles. Figure 7. Schematic of simultaneous copiable SRAM.

IV. RESULT which faults often occur. Consequently, the proposed architecture can maintain low latency by setting the checkpoint interval low. Figure 8. Measured copy cycle, where the number of working registers is 360. The evaluated cycle time in the DMR phase is presented in Figure 8. Assuming the number of working registers as 360, we can estimate the cycle time in the DMR phase. The conventional DMR produces a comparison using CRC. Thereby it takes two cycles to compare CRCs. However, instantaneous comparison requires four cycles. Although copying between the registers via a shared bus requires 360 cycles, it takes four cycles to copy using a simultaneous copy scheme with 7T SRAM: the conventional DMR architecture needs 362 cycles in the DMR phase. In contrast, the proposed architecture requires eight cycles for comparison and copy and reduces clock cycles by 97.8% compared to conventional copying via the shared bus. As shown in Figures 9 and 9, the y-axis shows the cycle overhead with the checkpoint/restart approach. The x- axis shows the checkpoint interval, which is the number of cycles between two consecutive checkpoints. Assuming that the execution time of the task is 50 μs, one cycle time is 16 ns. Figure 9 represents the cycle overhead of execution of a task that is fault-free. Figure 9 shows that a fault occurs in executing the task once. In Figure 9, as the checkpoint interval decreases, the cycle overhead increases rapidly because comparison and copy are performed frequently. In the proposed architecture, cycle overhead increases slightly when the checkpoint period is short because the proposed architecture dramatically reduces the cycle time in copy operation. When the checkpoint interval increases, both the cycle overheads of conventional and proposed architectures are low. However, if a failure occurs in the working register and the results of comparison are mismatched, then the DMR architecture must make a rollback checkpoint period. Figure 9 shows that the cycle overhead increases as the checkpoint interval becomes longer because the reexecution time is proportional to the checkpoint interval. The minimum cycle overhead of the conventional DMR and proposed DMR are 22.6% and 3.28%, respectively, in this case. Moreover, the difference of cycle overhead increases in an environment in Figure 9. Cycle overhead with respect to the checkpoint interval: fault-free and occurrence fault once. V. CONCLUSION We proposed a DMR architecture that can conduct instantaneous comparison with a simultaneous copy scheme using 7T SRAM. The proposed DMR architecture performs high-speed copying using a simultaneous copy scheme and high fault coverage using instantaneous comparison. The proposed architecture can execute operations with low latency even if the checkpoint interval becomes short. Consequently, the proposed DMR architecture provides benefits for real-time systems. When a user uses the proposed architecture, area overhead due to using 7T SRAM and instantaneous comparison buffer must be considered.

REFERENCES [1] J. Teifel, "Self-voting dual-modular-redundancy circuits for singleevent-transient mitigation," IEEE Transactions on Nuclear Science, vol.55, pp.3435 3439, 2008. [2] A. Ziv and J. Bruck, "Performance Optimization of Check-pointing Schemes with Task Duplication," IEEE Transactions on Computers, pp. 1381 1386, 1997. [3] Y. Zhang and K. Chakrabarty, "Energy-aware adaptive checkpointing in embedded real-time systems," DATE, vol.1, pp.918-923, 2003. [4] J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, "Reunion: Complexity-Effective Multicore Redundancy," MICRO, pp. 223 234, 2006. [5] H. Fujiwara, S. Okumura, Y. Iguchi, H. Noguchi, H. Kawaguchi, M. Yoshimoto, "A 7T/14T Dependable SRAM and its Array Structure to Avoid Half Selection," VLSI Design 2009, pp. 295 300, January, 2009. [6] S. Okumura, Y. Nakata, K. Yanagida and Y. Kagiyama, "Low-power block-level instantaneous comparison 7T SRAM for dual modular redundancy," IEEE CICC, 2011. [7] S. Okumura, S. Yoshimoto, K. Yamaguchi, Y. Nakata, H. Kawaguchi and M. Yoshimoto, "7T SRAM enabling low-energy simultaneous block copy," IEEE CICC, 2010.