The study of hardware redundancy techniques to provide a fault tolerant system

Size: px
Start display at page:

Download "The study of hardware redundancy techniques to provide a fault tolerant system"

Transcription

1 Cumhuriyet Üniversitesi Fen Fakültesi Fen Bilimleri Dergisi (CFD), Cilt:36, No: 4 Özel Sayı (2015) ISSN: Cumhuriyet University Faculty of Science Science Journal (CSJ), Vol. 36, No: 4 Special Issue (2015) ISSN: The study of hardware redundancy techniques to provide a fault tolerant system Mostafa SADEGHI 1, Hossein SOLTANI 2, Mohamadreza KHAYYAMBASHI 3 1 Department of Computer, Zavareh Branch, Islamic Azad University, Zavareh, Iran 2 Department of Computer, Ashkezar Branch, Islamic Azad University, Ashkezar, Iran 3 Department of Computer Engineering, University of Isfahan, Isfahan, Iran Received: ; Accepted: Abstract Increasing the reliability of computer systems operations is feasible by means of fault tolerance. This tolerance in a digital system is achieved through redundancy in hardware, software,or computation. This sort of redundancy can be performed in static, dynamic, or hybrid configuration. Hardware redundancy is obtained by providing two or more physical samples of a hardware component. In this paper, we study different hardware redundancy techniques.its efficiency and problems. Keywords: Fault tolerance, Hardware redundancy, TMR structure, reliability, availability. INTRODUCTION Any system which has the capability of conducting correct performance under the condition of fault in hardware or software, is called fault tolerant [1].Today,as computer systems are getting more complicated,because of lack of integrity in most parts of a system and necessity for intervention of various factors in output. it is vital to design a system that does not suffer from a major fault when there is a problem in one of its parts,and can maintain its correct performance,and simply by a change in overall efficiency can bring about the final goal. Digital systems have more critical tasks, therefore they need higher reliability.usual using design techniques and components with high quality do not decrease failure probability sufficiently. It means that systems must be fault tolerant. The most important technique so far used for fault tolerance in systems, is redundancy. Definitions of failure, fault, and error will be given later in this paper. Furthermore regarding hardware faults and its kinds.fault tolerance,purposes of designing fault tolerance and its usages,components of fault tolerance strategies,relation between redundancy and fault tolerance, hardware redundancy and its techniques, we will discuss finally, a conclusion of discussed issues will be offered[2]. Failures, Faults and Errors These 3 terminologies have different meaning: Failure-it indicates disability of a component to perform its predetermined task. Error-it is a sign of failure in system. In this case, the logic of a job is different from expected value. Failure in a system does not necessarily lead to an error.error occurs when there is a critical failure in the system. In other words, error happens when for a given condition of the input, incorrect output and consequence come out. Fault- It is an unusual physical case which occurs because of design error,such as mistakes in specifications, or configuring the system, or industrial problems.modeling and protection against failures due to designation errors and internal factors in tough, because anticipation of effects and their outcome is difficult [3,4]. * Corresponding authors. s: msadeghi@khuisf.ac.ir, soltaniyazdi@yahoo.com, m.r.khayyambashi@eng.ui.ac.ir Special Issue: Technological Advances of Engineering Sciences Faculty of Science, Cumhuriyet University

2 The study of hardware redundancy techniques to provide a fault tolerant system Specification of fault A fault can be classified by its duration, nature, and size. Duration of a fault may be transient or permanent.transient fault is normally the result of internal disturbance, and exists for a limited period and is irrevocable permanent or, hard faults are situations of the device that are not corrected by passing time. This kind of error results from component breakdown, physical detect of components or designation failure. A system with alternative fault alternates between failed and successful performance [1,5]. The nature of a fault is determined by its behavior in system. A logical fault produces errors which can be displayed as logic numbers, where as the errors resulting from indefinite fault don t have logic equivalent. The size of a fault is determined by the region affected with that failure. Local faults affect on individual components, while total fault influence several components.because of cost limitation, a lot of fault tolerance strategies merely reply to individual faults multiple failures require expensive failure models and total method for fault tolerance. Hardware faults These are categorized by considering duration, into permanent, transient,and alternative[6]. A permanent fault remains active until an activity is not corrected.this kind of fault is normally produced by some physical defects in hardware permanent faults are detected by online test methods that works with normal operation of system[7]. A transient fault is active for a short period of time if such a fault is activated alternately, it is called alternative fault. Because of their short period, transient faults are often detected through the errors resulted from their propagation. Alternate faults are usually called soft fault or glitches as well[8]. Philosophy of designation to overcome the error Generally speaking, there are 3 methods to overcome errors and maintain the system in its normal condition.these methods are described as follows: a. fault avoidance- it includes any technique applied to prevent fault or error. b. fault masking it consists of any procedure that after occurrence of fault, at least prevent the system from facing error. c. fault tolerance The ability of a system to continue its performance in spite of faults this factor relates to reliability, successful performance,and lack of collapse.a fault tolerant system must be able to manage the faults in hardware or software components, electrical break down, or any other unexpected defects. The main problem is that as complexity of a system increase, its reliability decreases, unless correcting criteria is considered.another problem is that although designers do their best to clear the system from software errors and hardware faults before the system is used, this goal isn t available, because some environment factors are inevitable and some user s mistakes are unpredictable.therefore, it is possible that faults are out of designer s control in some circumstances, when a system is perfectly designed and accomplished[9]. Applications of fault tolerance In many critical applied programs for security, trade, and spying, fault tolerance is necessary.critical security programs, such as where loss of life or environmental danger must be avoided, like aircraft control systems, radiotherapy mechanisms controlled by computer, guiding system for human heart or military radar.critical applied commercial programs are those that perform commercial jobs, such as trade system for bank transaction[1]. 237

3 SADEGHİ, SOLTANİ, KHAYYAMBASHİ Purpose of designing fault tolerance The objective of such designation is increase reliability by providing conditions for a system to continue its operation in spite of existing some inputting faults.it should be considered that a fault tolerant system does not necessarily provide high reliability, or that higher reliability does not necessarily mean fault tolerance.a main goal for a fault tolerant system might be that no single faults can fail the system[5]. Components of fault tolerance strategy Fault tolerance in a system is achieved through redundancy in hardware, software, information, or computation. This redundancy can be performed in static, dynamic, or hybrid configuration. A fault tolerant strategy consists of one (or more) of following factors: Masking dynamic correction of produced fault. Detection to detect an error(the sign of fault). Containment preventing propagation of an error in defined boundaries. Diagnosis to find the faulty module this is responsible for detected error. Repair /Reconfiguration to remove or replace a faulty component,or a mechanism to ignore it. Recovery-changing the condition of a system from faulty to acceptable for work.in order to have excellent performance of a secured file, when there is no time for detection and recovery of offline fault, a static or passive configuration is designed to hide as many faults as possible. On the other hand, dynamic redundancy is engaged by switching modules with further routing during occurrence a fault. In hybrid method, some faults are covered by static configuration, while faulty modules are detected and replaced. Hybrid redundancy is desirable for applied programs with high reliability in which the probability of appearing several faults is high. Fault tolerance and Redundancy There are different methods to achieve fault tolerance. The most common approach is existence of definite number of redundancy by definition, redundancy is predicting operational capabilities. There are two kinds of redundancy namely, space and time. Space redundancy provides redundant items, space, components, or function not necessary for a fault-free operation. This sort of redundancy is classified itself into hardware, software, and information, depending on the redundant source to the system. In time redundancy, calculation is repeated with data transfer, and the result is compared with the result of previous copy saved in the system[1]. Hardware redundancy This is obtained by making two (or more) physical samples of a hardware component. For example, a system may consist of extra processors, memories, buses, or power. Hardware redundancy is often the only available approach to increase reliability of a system, because other techniques such as using more qualified components are set aside, or in comparison with redundancy they are more costly[10]. There are 3 kinds of hardware redundancy: passive, active and hybrid. Passive method works as hiding the fault, while active redundancy is used for detection and recovery[4]. Passive redundancy It performs hiding the fault based on basic polling. This method covers and hides faults, instead of detecting them. Hide (or disguise) of a fault ensures that despite a fault, only correct data are transmitted to the output of system. One advantage of passive redundancy is that continuous operation is guaranteed, because any faults in redundant modules shows itself immediately, unless number of faulty modules is more than what a voter can bear (tolerate)[4]. In following paragraphs some techniques of passive redundancy is described. a. TMR technique- the TMR structure is a fault-tolerant architecture based on three identical modules which accomplish the same job. The inputs of these modules receive the same data that 238

4 The study of hardware redundancy techniques to provide a fault tolerant system are very close to each other their outputs supply a majority polling circuit[1-10].hence, TMR architecture reduces probability of error in primary output of system. The faculty module transmits a wrong value which can be hidden by means of two fault-free modules. In the simplest structure of TMR, voter is a weak point. If a problem appears in the voter, then TMR structure may get faulty. To avoid this problem, voter can be detected by more powerful software or designation techniques[2].the most common form of passive redundancy is TMR, whose basic configuration is indicated in figure 1[10]. Components are triple in order to perform identical calculations in parallel. Voter is utilized to determine the correct result.if one of voters fails, the voter of majority will hide the fault by results of the two fault-free modules. Input 1 Input 2 Voter Input 3 Module 3 Figure 1. TMR Technique TMR system can hide only one fault of a module.any failure in remaining modules will cause voter to produce wrong result. As long as other two modules work properly, a TMR system can perform correctly as well[9-10]. Suppose that voter is perfect and component failures are mutually exclusive, reliability of a TMR system is calculated by this formula: R TMR = R 1 R 2 R 3 + (1 R 1 )R 2 R 3 + R 1 (1 R 2 )R 3 + R 1 R 2 (1 R 3 ) The term R 1 R 2 R 3 explains probability that all 3 modules work properly. The term (1 R 1 )R 2 R 3 indicates probability that first module fails, while second and third ones act properly.the term R 1 (1 R 2 )R 3 states probability that first and third module work correctly but second module fails. The term R 1 R 2 (1 R 3 ) shows probability of first and second module working correctly, while third module fails. An accurate estimate for reliability of a TMR system, so that reliability of the voter is also taken into account, is as follows: R TMR = (3R 2 2R 3 )R 1 The voter is in combination with redundant modules, because in case the voter fails, the whole system fails. In order that reliability of a TMR system is much higher than a simple system, reliability of the voter must be very high. Fortunately in comparison with redundant components, voter is a simple unit. There for the probability of its failure is much lower[4,10]. Still in some systems, exiting only one breakdown point is unacceptable. In our definition a component is called the only breakdown point, when its failure leads to the failure of whole system. In this case, more complicated voting schematics are utilized. In order not to focus the system on only one voter, we extend it to three voters. In figure 2 such configuration is shown. No concentrated polling prevents the only breakdown point, but it requires unanimity of all three voters[2,4]. 239

5 SADEGHİ, SOLTANİ, KHAYYAMBASHİ Input 1 V1 1 Input 2 V2 2 Input 3 Module 3 V3 3 Figure 2. TMR with 3 Voter Polling can be performed as hardware or software. Hardware polling is usually fast enough to reply in any time boundary.if polling is accomplished by software voters, enough time may not be available[10]. x1 x2 f x3 Figure 3. Logical Diagram of 3 Input Voter A majority voter with 3 inputs for N digital data is shown in figure 3. The amount of output 1 is determined by majority values of inputs X 1, X 2 and X 3. Table 1 indicates definition for this polling. Table 1 F X1 X2 X Basically, TMR architecture can tolerate one fault; however in practice it may tolerate more than one fault. In fact, if there are two faults, TMR can operate properly depending. One the nature and place of fault.if errors causing the voter not to be driven in time, fault is easily tolerated by TMR structure[2,10].in order for the error not to be tolerated, two faults should be placed in two different modules and then an error is transmitted in identical outputs on each module. In figure 4 two samples with the same pattern are indicated that supply 3 modules. 240

6 The study of hardware redundancy techniques to provide a fault tolerant system f1 f1 f2 f2 a b Figure 4. Two Faults a)tolerable, b)untolerable Voter is omitted. Each fault is modeled as stuck-at fault ( F 1 and F 2 respectively ). In part A of figure 7, F 1 is moving towards output O 1 in first module and F 2 is sent to the output O 2 in second module. The voter receives two correct values and one incorrect value. Therefore, TMR output is correct and F 1 & F 2 are tolerated[3].in part b of figure 4, F 1 is moving towards O 1 and O 2, while F 2 is transmitted in O 2. The voter receives one wrong value for O 1 and two wrong values for O 2. As a result, the value appearing on the second output of TMR is a faulty value. Hence F 1 and F 2 are not tolerated. We can conclude that two faults are tolerated, when they come from two points of different modules, and occur on the identical outputs of each module. In case more than one fault happens, separate faults may be managed by considering all possible pair faults[2,10]. b. NMR Technique-Redundancy method of N modules (or NMR) is based on the same principles of TMR, but instead of 3 modules it uses N modules, as shown in figure 5. Input 1 Input 2 Voter Input n Module n Figure 5. NMR Technique N is usually chosen as an odd number so that majority voting is possible. An NMR system can disguise n/2 faults of modules [4,10]. Active Redundancy This type of redundancy is defined by detection of faults and taking responsibility of some activities for recovery. There are many techniques For fault detection.a method of effective detection of fault is utilizing hardware redundancy of component repetition with comparator. After fault detection, the system should be recovered quickly and accurately. 241

7 SADEGHİ, SOLTANİ, KHAYYAMBASHİ a. The technique of computation with Duplication - The basic from of active redundancy is repetition with comparison. It is shown in Figur6. Two identical modules perform identical computation in parallel. The result of computation is compared by means of a comparator. If the results don t match, an error signal is produced. A schematic of repetition with comparison can detect only one fault. Having detected the fault, no more activity is done by the system to return to operational mode. Input 1 Input 2 = Error Signal Figure 6. Comparison With Duplication b. Standby Sparing Technique -This technique is another form for active hardware redundancy. Only one of N modules is operational and provides the output of the system The remaining N-1 modules serve as spare. A spare is a redundant module which is necessary for normal performance of a system. Switch is a device that monitors the active module. Whenever an error is reported by the unit of FD error detection, it switches operations to an accessory[4]. There are two types of standby sparing, namely hot and cold. In the former, both operational and spare modules are on. Spares can immediately be switched and utilized after failure of operational modules. In the latter, spare modules are off until it is necessary to replace the faulty modules. One disadvantage of cold standby sparing is that it takes time to supply power to the module, to perform primary valuation, and to make renewed computation. One advantage Is that standby spares do not consume power. This is particularly important in applications such as satellite system in which power consumption is critical. Any standby sparing system with N modules can tolerate N-1 faults. Here, when we say tolerance, we mean the system detects faults, and then recovers them successfully to continue their service properly. When N th fault occurs, still it is detectable; however the system is not susceptible for recovery and returning to normal operation. C. Pair and Spare Technique- This technique combines the method of repetition and comparison with that of standby sparing. The idea is simulator to standby sparing, with the difference that instead of one module, two operational modules are working in Parallel here. Similar to the case of repetition and comparison, where the results are compare to detect the difference, if an error signal is received from comparator, the switch analyzes the report of error detection, and then determines the output of which module is faulty. The faulty module is deleted and a spare module replaces it[4]. A pair and-a spare system with N modules can tolerate N-1 faults. D. Using watch dog timer- In this technique we use a timer for error detection. By this method, from lack of an occurrence we conclude that some defect has happened in the system. This timer must be reset periodically. Any defect which causes this function not to perform, makes the system turn off. Thereby no major defects happen. Basic assumption in this technique is that system health causes the timer to be reset alternatively. HYBRID HARDWARE REDUNDANCY In hybrid approaches, attractive features of the both methods mentioned above are combined. Actually, this is the most common type of hardware redundancy. However, these methods are very costly and therefore only in the applications that fault tolerance is really necessary are utilized. One 242

8 The study of hardware redundancy techniques to provide a fault tolerant system of the most important techniques of this method is NMR with spares. This approach combines both ideas of NMR and standby sparing. Figure7 shows this smart idea. In each period, disagreement detector compares the referee output with the outputs of every single module. If there is any difference between the output of referee and that of module, the module is considered faulty and is replaced with a reserve module [8]. A. Duplex-Triplex Architecture: In this approach, as the Figure 8 indicates combines two ideas of Duplication with comparison and TMR. Utilizing TMR makes it possible to hide the error, while using the duplex hardware in addition to comparator causes the errors to be detected and therefore faulty module is omitted from voting process [8,9]. B. Self-purging Redundancy: another method for hybrid redundancy is self-purging. In this approach, as the Figure 9 indicates, each module has a switch that compares referee output with the output of module. In case there is no agreement, that module is omitted from the system. It should be noted that in this structure, judgment must be in the form of threshold gate[9]. CONCLUSION The objective of designing error tolerance system is to improve reliability by making it possible for the system to perform its operation, despite some input faults. It should be noted that an error tolerant system does not necessarily guarantee high reliability, or vice versa. A definite goal for a fault tolerant system is that no signal fault can cause system to fail. Hardware redundancy brings some penalties with itself, such as increasing weight, power consumption Size, designation time, manufacturing time, test time. The best method to apply redundancy in a system is determined by considering the above mentioned factors. For instance, increasing weight may be modified by applying redundancy to low level components. 243

9 SADEGHİ, SOLTANİ, KHAYYAMBASHİ Disagreement Detector Active Unit Inputs Disagreement Identification System Inputs Switch Voter Module N Spare 1 Spare M Fig. 7 Input 1a a S1 Input 1b b = Input 2a a S2 Voter Input 2b b = Input 3a Input 3b Module 3a Module 3b = S3 Fig. 8 Input 1 S1 Input 2 S2 Voter Input 3 Module n Sn Fig. 9 REFERENCES [1] H. Fu, M. Cai, L. Fang, P. Liu and J. Dongl, Research on RTOS-Integrated TMR for Fault Tolerant Systems,8 th ACIS International Conference on Software Engineering, Artificial Intelligence,Networking, and Parallel/Distributed Computing, IEEE, [2] J. Vial, A. Bosio, P. Girard, C. Landrault, S. Pravossoudovitch and A. Virazel, Using TMR Architectures for Yield Improvement, International Symposium on Defect and Fault Tolerance of VLSI Systems, IEEE Computer Society, [3] M. H. Mottaghi, H. R. Zarandi, "DFTS: A Dynamic Fault-Tolerant Scheduling for Real- Time Tasks in Multicore Processors," Elsevier Journal of Microprocessors & Microsystems, vol. 38, no. 1, PP , [4] M. Murakami, Task-based Dynamic Fault Tolerance for Humanoid Robots, Conference on systems, Man, and Cybernetics, IEEE, Taipei, Taiwan,October,2006. [5] H. Aliee, H. R. Zarandi, "A Fast and Accurate Fault Tree Analysis Based on Stochastic Logic Implemented on Field-Programmable Gate Arrays," IEEE Transactions on Reliability, vol. 61, no. 4, pp , [6] Ebrahimi, M. Mohammadi, A.. Ejlali, A., S., Miremadi, "A fast, flexible, and easy-todevelop FPGA-based fault injection technique", Elsevier Journal of Microelectronics Reliability, No. 54, pp , [7] Ghaderi, Z., Miremadi, S. G., Asadi, H., Fazeli, M., "HAFTA: Highly Available Fault- Tolerant Architecture to Protect SRAM-Based Reconfigurable` Devices Against Multiple Bit Upsets," IEEE Transactions on Device and Materials Reliability (TDMR), Vol. 13, No. 1, pp , March

10 The study of hardware redundancy techniques to provide a fault tolerant system [8] Sengupta, A, Bhadauria,S. Bacterial Foraging Driven Exploration of Multi Cycle Fault Tolerant Datapath based on Power-Performance Tradeoff in High Level Synthesis, Elsevier Journal on Expert Systems With Applications, [9] Eghbal, H. Pedram, P. Yaghini, H. R. Zarandi, "Designing a Fault-tolerant NoC Router Architecture Respecting Fault Effects," International Journal of Electronics, Francis & Taylor on Network-on-Chip, vol. 97, no. 10, pp , [10] Ebrahimi, M., Miremadi, S. G., Asadi, H., Fazeli, M., "A Low Cost Scan Chain-Based Technique to Recover Multiple Errors in TMR Systems," IEEE Transactions on Very Large Scale Integration Systems (TVLSI), Vol. 21, No. 8, pp , August

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques : Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems

Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems Kripa K B 1, Akshatha K N 2,Nazma S 3 1 ECE dept, Srinivas Institute of Technology 2 ECE dept, KVGCE 3 ECE dept, Srinivas Institute

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 3 - Resilient Structures Chapter 2 HW Fault Tolerance Part.3.1 M-of-N Systems An M-of-N system consists of N identical

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs

Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs Hamid R. Zarandi,2, Seyed Ghassem Miremadi, Costas Argyrides 2, Dhiraj K. Pradhan 2 Department of Computer Engineering, Sharif

More information

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Wei Chen, Rui Gong, Fang Liu, Kui Dai, Zhiying Wang School of Computer, National University of Defense Technology,

More information

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2894-2900 ISSN: 2249-6645 High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs M. Reddy Sekhar Reddy, R.Sudheer Babu

More information

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect 6. Fault Tolerance (a) Introduction. (b) Types of faults. (c) Fault models. (d) Fault coverage. (e) Redundancy. (f) Fault detection techniques. (g) Hardware fault tolerance. (h) Software fault tolerance.

More information

Improved Fault Tolerant Sparse KOGGE Stone ADDER

Improved Fault Tolerant Sparse KOGGE Stone ADDER Improved Fault Tolerant Sparse KOGGE Stone ADDER Mangesh B Kondalkar 1 Arunkumar P Chavan 2 P Narashimaraja 3 1, 2, 3 Department of Electronics and Communication, R V college of Engineering, Bangalore

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

Multiple Event Upsets Aware FPGAs Using Protected Schemes

Multiple Event Upsets Aware FPGAs Using Protected Schemes Multiple Event Upsets Aware FPGAs Using Protected Schemes Costas Argyrides, Dhiraj K. Pradhan University of Bristol, Department of Computer Science Merchant Venturers Building, Woodland Road, Bristol,

More information

SPECIAL ISSUE ENERGY, ENVIRONMENT, AND ENGINEERING SECTION: RECENT ADVANCES IN BIG DATA ANALYSIS (ABDA) ISSN:

SPECIAL ISSUE ENERGY, ENVIRONMENT, AND ENGINEERING SECTION: RECENT ADVANCES IN BIG DATA ANALYSIS (ABDA) ISSN: ISSN: 976-314 ARTICLE CONCURRENT ERROR DETECTION WITH SELF-CHECKING MAJORITY VOTING CIRCUITS V. Elamaran 1*, VR. Priya 2, M. Chandrasekar 1, Har Narayan Upadhyay 1 ABSTRACT 1 Department of ECE, School

More information

Dependability. IC Life Cycle

Dependability. IC Life Cycle Dependability Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr IC Life Cycle User s Requirements Design Re-Cycling In-field Operation Production 2 1 IC Life Cycle User s

More information

Fault-tolerant system design using novel majority voters of 5-modular redundancy configuration

Fault-tolerant system design using novel majority voters of 5-modular redundancy configuration Fault-tolerant system design using novel majority voters of 5-modular redundancy configuration V.Elamaran, G.Rajkumar, N.Raju, K.Narasimhan, Har Narayan Upadhyay School of EEE, Department of ECE, SASTRA

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

Cloud Computing Resource Planning Based on Imperialist Competitive Algorithm

Cloud Computing Resource Planning Based on Imperialist Competitive Algorithm Cumhuriyet Üniversitesi Fen Fakültesi Fen Bilimleri Dergisi (CFD), Cilt:36, No: 4 Özel Sayı (205) ISSN: 300-949 Cumhuriyet University Faculty of Science Science Journal (CSJ), Vol. 36, No: 4 Special Issue

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

Page 1. Outline. A Good Reference and a Caveat. Testing. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Testing and Design for Test

Page 1. Outline. A Good Reference and a Caveat. Testing. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Testing and Design for Test Page Outline ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems Testing and Design for Test Copyright 24 Daniel J. Sorin Duke University Introduction and Terminology Test Generation for Single

More information

Defect Tolerance in VLSI Circuits

Defect Tolerance in VLSI Circuits Defect Tolerance in VLSI Circuits Prof. Naga Kandasamy We will consider the following redundancy techniques to tolerate defects in VLSI circuits. Duplication with complementary logic (physical redundancy).

More information

An Energy-Efficient Scan Chain Architecture to Reliable Test of VLSI Chips

An Energy-Efficient Scan Chain Architecture to Reliable Test of VLSI Chips An Energy-Efficient Scan Chain Architecture to Reliable Test of VLSI Chips M. Saeedmanesh 1, E. Alamdar 1, E. Ahvar 2 Abstract Scan chain (SC) is a widely used technique in recent VLSI chips to ease the

More information

A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy

A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy Andrew L. Baldwin, BS 09, MS 12 W. Robert Daasch, Professor Integrated Circuits Design and Test Laboratory Problem Statement In a fault

More information

Enabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters

Enabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters Enabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters ECE 7502 Class Discussion Ningxi Liu 14 th Apr 2015 ECE 7502 S2015 Customer Validate Requirements Verify Specification

More information

A Low-Cost Correction Algorithm for Transient Data Errors

A Low-Cost Correction Algorithm for Transient Data Errors A Low-Cost Correction Algorithm for Transient Data Errors Aiguo Li, Bingrong Hong School of Computer Science and Technology Harbin Institute of Technology, Harbin 150001, China liaiguo@hit.edu.cn Introduction

More information

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Authors: Robert L Akamine, Robert F. Hodson, Brock J. LaMeres, and Robert E. Ray www.nasa.gov Contents Introduction to the

More information

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability CDA 5140 Software Fault-tolerance - so far have looked at reliability as hardware reliability - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

More information

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.) ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2 Overview Fault Modeling References Fault models at different levels (HW)

More information

Diagnosis in the Time-Triggered Architecture

Diagnosis in the Time-Triggered Architecture TU Wien 1 Diagnosis in the Time-Triggered Architecture H. Kopetz June 2010 Embedded Systems 2 An Embedded System is a Cyber-Physical System (CPS) that consists of two subsystems: A physical subsystem the

More information

EDAC FOR MEMORY PROTECTION IN ARM PROCESSOR

EDAC FOR MEMORY PROTECTION IN ARM PROCESSOR EDAC FOR MEMORY PROTECTION IN ARM PROCESSOR Mrs. A. Ruhan Bevi ECE department, SRM, Chennai, India. Abstract: The ARM processor core is a key component of many successful 32-bit embedded systems. Embedded

More information

SAN FRANCISCO, CA, USA. Ediz Cetin & Oliver Diessel University of New South Wales

SAN FRANCISCO, CA, USA. Ediz Cetin & Oliver Diessel University of New South Wales SAN FRANCISCO, CA, USA Ediz Cetin & Oliver Diessel University of New South Wales Motivation & Background Objectives & Approach Our technique Results so far Work in progress CHANGE 2012 San Francisco, CA,

More information

A novel probabilistic bit voter using genetic algorithm for faulttolerant

A novel probabilistic bit voter using genetic algorithm for faulttolerant www.ijsi.org 88 A novel probabilistic bit voter using genetic algorithm for faulttolerant systems Manizheh Mirsaeidi 1, Abbas Karimi 2 1 Department of omputer Engineering, Faculty of Engineering, Arak

More information

A Hybrid Fault-Tolerant Architecture for Highly Reliable Processing Cores

A Hybrid Fault-Tolerant Architecture for Highly Reliable Processing Cores J Electron Test (2016) 32:147 161 DOI 10.1007/s10836-016-5578-0 A Hybrid Fault-Tolerant Architecture for Highly Reliable Processing Cores I. Wali 1 Arnaud Virazel 1 A. Bosio 1 P. Girard 1 S. Pravossoudovitch

More information

An Integrated ECC and BISR Scheme for Error Correction in Memory

An Integrated ECC and BISR Scheme for Error Correction in Memory An Integrated ECC and BISR Scheme for Error Correction in Memory Shabana P B 1, Anu C Kunjachan 2, Swetha Krishnan 3 1 PG Student [VLSI], Dept. of ECE, Viswajyothy College Of Engineering & Technology,

More information

A Low Area Overhead Fault Tolerant Strategy for Multiple Stuck-At-Faults in Digital Circuits

A Low Area Overhead Fault Tolerant Strategy for Multiple Stuck-At-Faults in Digital Circuits A Low Area Overhead Fault Tolerant Strategy for Multiple Stuck-At-Faults in Digital Circuits John Kalloor * Research Scholar, Department of Electrical and Electronics Engineering, Annamalai University,

More information

Improving FPGA Design Robustness with Partial TMR

Improving FPGA Design Robustness with Partial TMR Improving FPGA Design Robustness with Partial TMR Brian Pratt, Michael Caffrey, Paul Graham, Keith Morgan, Michael Wirthlin Abstract This paper describes an efficient approach of applying mitigation to

More information

Fault-tolerant techniques

Fault-tolerant techniques What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques

More information

Network Survivability

Network Survivability Network Survivability Bernard Cousin Outline Introduction to Network Survivability Types of Network Failures Reliability Requirements and Schemes Principles of Network Recovery Performance of Recovery

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

A CAN-Based Architecture for Highly Reliable Communication Systems

A CAN-Based Architecture for Highly Reliable Communication Systems A CAN-Based Architecture for Highly Reliable Communication Systems H. Hilmer Prof. Dr.-Ing. H.-D. Kochs Gerhard-Mercator-Universität Duisburg, Germany E. Dittmar ABB Network Control and Protection, Ladenburg,

More information

Fault-tolerant design techniques. slides made with the collaboration of: Laprie, Kanoon, Romano

Fault-tolerant design techniques. slides made with the collaboration of: Laprie, Kanoon, Romano Fault-tolerant design techniques slides made with the collaboration of: Laprie, Kanoon, Romano Fault Tolerance Key Ingredients Error Processing ERROR PROCESSING Error detection: identification of erroneous

More information

New Active Caching Method to Guarantee Desired Communication Reliability in Wireless Sensor Networks

New Active Caching Method to Guarantee Desired Communication Reliability in Wireless Sensor Networks J. Basic. Appl. Sci. Res., 2(5)4880-4885, 2012 2012, TextRoad Publication ISSN 2090-4304 Journal of Basic and Applied Scientific Research www.textroad.com New Active Caching Method to Guarantee Desired

More information

A Robust Bloom Filter

A Robust Bloom Filter A Robust Bloom Filter Yoon-Hwa Choi Department of Computer Engineering, Hongik University, Seoul, Korea. Orcid: 0000-0003-4585-2875 Abstract A Bloom filter is a space-efficient randomized data structure

More information

Single Event Upset Mitigation Techniques for SRAM-based FPGAs

Single Event Upset Mitigation Techniques for SRAM-based FPGAs Single Event Upset Mitigation Techniques for SRAM-based FPGAs Fernanda de Lima, Luigi Carro, Ricardo Reis Universidade Federal do Rio Grande do Sul PPGC - Instituto de Informática - DELET Caixa Postal

More information

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki Introduction to Software Fault Tolerance Techniques and Implementation Presented By : Hoda Banki 1 Contents : Introduction Types of faults Dependability concept classification Error recovery Types of redundancy

More information

Research Article Dynamic Reconfigurable Computing: The Alternative to Homogeneous Multicores under Massive Defect Rates

Research Article Dynamic Reconfigurable Computing: The Alternative to Homogeneous Multicores under Massive Defect Rates International Journal of Reconfigurable Computing Volume 2, Article ID 452589, 7 pages doi:.55/2/452589 Research Article Dynamic Reconfigurable Computing: The Alternative to Homogeneous Multicores under

More information

Generic Scrubbing-based Architecture for Custom Error Correction Algorithms

Generic Scrubbing-based Architecture for Custom Error Correction Algorithms Generic Scrubbing-based Architecture for Custom Error Correction Algorithms Rui Santos, Shyamsundar Venkataraman Department of Electrical & Computer Engineering National University of Singapore Email:

More information

DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE

DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE DivyaRani 1 1pg scholar, ECE Department, SNS college of technology, Tamil Nadu, India -----------------------------------------------------------------------------------------------------------------------------------------------

More information

Evaluation of FPGA Resources for Built-In Self-Test of Programmable Logic Blocks

Evaluation of FPGA Resources for Built-In Self-Test of Programmable Logic Blocks Evaluation of FPGA Resources for Built-In Self-Test of Programmable Logic Blocks Charles Stroud, Ping Chen, Srinivasa Konala, Dept. of Electrical Engineering University of Kentucky and Miron Abramovici

More information

Czech Technical University in Prague Faculty of Electrical Engineering. Doctoral Thesis

Czech Technical University in Prague Faculty of Electrical Engineering. Doctoral Thesis Czech Technical University in Prague Faculty of Electrical Engineering Doctoral Thesis March 2007 Pavel Kubalík Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer

More information

Built-in Self-Test and Repair (BISTR) Techniques for Embedded RAMs

Built-in Self-Test and Repair (BISTR) Techniques for Embedded RAMs Built-in Self-Test and Repair (BISTR) Techniques for Embedded RAMs Shyue-Kung Lu and Shih-Chang Huang Department of Electronic Engineering Fu Jen Catholic University Hsinchuang, Taipei, Taiwan 242, R.O.C.

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

Design and Implementation of Fault Tolerant Adders on Field Programmable Gate Arrays

Design and Implementation of Fault Tolerant Adders on Field Programmable Gate Arrays University of Texas at Tyler Scholar Works at UT Tyler Electrical Engineering Theses Electrical Engineering Spring 4-27-2012 Design and Implementation of Fault Tolerant Adders on Field Programmable Gate

More information

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design

More information

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Subash Chandar G (g-chandar1@ti.com), Vaideeswaran S (vaidee@ti.com) DSP Design, Texas Instruments India

More information

Error Resilience in Digital Integrated Circuits

Error Resilience in Digital Integrated Circuits Error Resilience in Digital Integrated Circuits Heinrich T. Vierhaus BTU Cottbus-Senftenberg Outline 1. Introduction 2. Faults and errors in nano-electronic circuits 3. Classical fault tolerant computing

More information

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan White paper Version: 1.1 Updated: Oct., 2017 Abstract: This white paper introduces Infortrend Intelligent

More information

2oo4D: A New Design Concept for Next-Generation Safety Instrumented Systems 07/2000

2oo4D: A New Design Concept for Next-Generation Safety Instrumented Systems 07/2000 2oo4D: A New Design Concept for Next-Generation Safety Instrumented Systems 07/2000 Copyright, Notices and Trademarks 2000 Honeywell Safety Management Systems B.V. Revision 01 July 2000 While this information

More information

Responsive Roll-Forward Recovery in Embedded Real-Time Systems

Responsive Roll-Forward Recovery in Embedded Real-Time Systems Responsive Roll-Forward Recovery in Embedded Real-Time Systems Jie Xu and Brian Randell Department of Computing Science University of Newcastle upon Tyne, Newcastle upon Tyne, UK ABSTRACT Roll-forward

More information

AUTONOMOUS RECONFIGURATION OF IP CORE UNITS USING BLRB ALGORITHM

AUTONOMOUS RECONFIGURATION OF IP CORE UNITS USING BLRB ALGORITHM AUTONOMOUS RECONFIGURATION OF IP CORE UNITS USING BLRB ALGORITHM B.HARIKRISHNA 1, DR.S.RAVI 2 1 Sathyabama Univeristy, Chennai, India 2 Department of Electronics Engineering, Dr. M. G. R. Univeristy, Chennai,

More information

Basic Concepts of Reliability

Basic Concepts of Reliability Basic Concepts of Reliability Reliability is a broad concept. It is applied whenever we expect something to behave in a certain way. Reliability is one of the metrics that are used to measure quality.

More information

HP Advanced Memory Protection technologies

HP Advanced Memory Protection technologies HP Advanced Memory Protection technologies technology brief, 5th edition Abstract... 2 Introduction... 2 Memory errors... 2 Single-bit and multi-bit errors... 3 Hard errors and soft errors... 3 Increasing

More information

Intel iapx 432-VLSI building blocks for a fault-tolerant computer

Intel iapx 432-VLSI building blocks for a fault-tolerant computer Intel iapx 432-VLSI building blocks for a fault-tolerant computer by DAVE JOHNSON, DAVE BUDDE, DAVE CARSON, and CRAIG PETERSON Intel Corporation Aloha, Oregon ABSTRACT Early in 1983 two new VLSI components

More information

Self-checking combination and sequential networks design

Self-checking combination and sequential networks design Self-checking combination and sequential networks design Tatjana Nikolić Faculty of Electronic Engineering Nis, Serbia Outline Introduction Reliable systems Concurrent error detection Self-checking logic

More information

Exact Parallel Plurality Voting Algorithm for Totally Ordered Object Space Fault-Tolerant Systems

Exact Parallel Plurality Voting Algorithm for Totally Ordered Object Space Fault-Tolerant Systems Pertanika J. Sci. & Technol. 20 (1): 89 96 (2012) ISSN: 0128-7680 Universiti Putra Malaysia Press Exact Parallel Plurality Voting Algorithm for Totally Ordered Object Space Fault-Tolerant Systems Abbas

More information

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan White paper Version: 1.1 Updated: Sep., 2017 Abstract: This white paper introduces Infortrend Intelligent

More information

Improving Fault Tolerance of Network-on-Chip Links via Minimal Redundancy and Reconfiguration

Improving Fault Tolerance of Network-on-Chip Links via Minimal Redundancy and Reconfiguration Improving Fault Tolerance of Network-on-Chip Links via Minimal Redundancy and Reconfiguration Hamed S. Kia, and Cristinel Ababei Department of Electrical and Computer Engineering North Dakota State University

More information

Part 2: Basic concepts and terminology

Part 2: Basic concepts and terminology Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness

More information

Accelerating CDC Verification Closure on Gate-Level Designs

Accelerating CDC Verification Closure on Gate-Level Designs Accelerating CDC Verification Closure on Gate-Level Designs Anwesha Choudhury, Ashish Hari anwesha_choudhary@mentor.com, ashish_hari@mentor.com Design Verification Technologies Mentor Graphics Abstract:

More information

FAULT TOLERANCE FOR DIGITAL SYSTEMS

FAULT TOLERANCE FOR DIGITAL SYSTEMS FAULT TOLERANCE FOR DIGITAL SYSTEMS Abstract Herbert Hecht SoHaR Incorporated Fault tolerance is an essential methodology for digital systems, particularly for those that serve applications where failure

More information

Automation Intelligence Enlighten your automation mind..!

Automation Intelligence Enlighten your automation mind..! Friends, It brings us immense pleasure to introduce Automation intelligence a knowledge series about automation, innovation, solutions & Technology. This series will discuss about PLC, DCS, Drives, SCADA,

More information

Sample Exam. Advanced Test Automation Engineer

Sample Exam. Advanced Test Automation Engineer Sample Exam Advanced Test Automation Engineer Answer Table ASTQB Created - 08 American Stware Testing Qualifications Board Copyright Notice This document may be copied in its entirety, or extracts made,

More information

Algorithms for Efficient Runtime Fault Recovery on Diverse FPGA Architectures

Algorithms for Efficient Runtime Fault Recovery on Diverse FPGA Architectures Algorithms for Efficient Runtime Fault Recovery on Diverse FPGA Architectures John Lach UCLA EE Department jlach@icsl.ucla.edu William H. Mangione-Smith UCLA EE Department billms@ee.ucla.edu Miodrag Potkonjak

More information

Testing for the Unexpected Using PXI

Testing for the Unexpected Using PXI Testing for the Unexpected Using PXI An Automated Method of Injecting Faults for Engine Management Development By Shaun Fuller Pickering Interfaces Ltd. What will happen if a fault occurs in an automotive

More information

HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths

HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths Mario Schölzel Department of Computer Science Brandenburg University of Technology Cottbus, Germany

More information

Fault Simulation. Problem and Motivation

Fault Simulation. Problem and Motivation Fault Simulation Problem and Motivation Fault Simulation Problem: Given A circuit A sequence of test vectors A fault model Determine Fault coverage Fraction (or percentage) of modeled faults detected by

More information

Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes

Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes 1 U.Rahila Begum, 2 V. Padmajothi 1 PG Student, 2 Assistant Professor 1 Department Of

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

Performance of Constant Addition Using Enhanced Flagged Binary Adder

Performance of Constant Addition Using Enhanced Flagged Binary Adder Performance of Constant Addition Using Enhanced Flagged Binary Adder Sangeetha A UG Student, Department of Electronics and Communication Engineering Bannari Amman Institute of Technology, Sathyamangalam,

More information

Reliability Improvement in Reconfigurable FPGAs

Reliability Improvement in Reconfigurable FPGAs Reliability Improvement in Reconfigurable FPGAs B. Chagun Basha Jeudis de la Comm 22 May 2014 1 Overview # 2 FPGA Fabrics BlockRAM resource Dedicated multipliers I/O Blocks Programmable interconnect Configurable

More information

Risk Management. Modifications by Prof. Dong Xuan and Adam C. Champion. Principles of Information Security, 5th Edition 1

Risk Management. Modifications by Prof. Dong Xuan and Adam C. Champion. Principles of Information Security, 5th Edition 1 Risk Management Modifications by Prof. Dong Xuan and Adam C. Champion Principles of Information Security, 5th Edition 1 Learning Objectives Upon completion of this material, you should be able to: Define

More information

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems 11. Consensus. Paul Krzyzanowski Distributed Systems 11. Consensus Paul Krzyzanowski pxk@cs.rutgers.edu 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one

More information

Algorithm for Determining Most Qualified Nodes for Improvement in Testability

Algorithm for Determining Most Qualified Nodes for Improvement in Testability ISSN:2229-6093 Algorithm for Determining Most Qualified Nodes for Improvement in Testability Rupali Aher, Sejal Badgujar, Swarada Deodhar and P.V. Sriniwas Shastry, Department of Electronics and Telecommunication,

More information

Fine-Grain Redundancy Techniques for High- Reliable SRAM FPGA`S in Space Environment: A Brief Survey

Fine-Grain Redundancy Techniques for High- Reliable SRAM FPGA`S in Space Environment: A Brief Survey Fine-Grain Redundancy Techniques for High- Reliable SRAM FPGA`S in Space Environment: A Brief Survey T.Srinivas Reddy 1, J.Santosh 2, J.Prabhakar 3 Assistant Professor, Department of ECE, MREC, Hyderabad,

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

Fault Tolerant Computing CS 530

Fault Tolerant Computing CS 530 Fault Tolerant Computing CS 530 Lecture Notes 1 Introduction to the class Yashwant K. Malaiya Colorado State University 1 Instructor, TA Instructor: Yashwant K. Malaiya, Professor malaiya @ cs.colostate.edu

More information

Fault Tolerant Asynchronous Adder through Dynamic Self-reconfiguration

Fault Tolerant Asynchronous Adder through Dynamic Self-reconfiguration Fault Tolerant Asynchronous Adder through Dynamic Self-reconfiguration Song Peng and Rajit Manohar Computer Systems Laboratory Cornell University Ithaca, NY 14853, USA {speng,rajit}@csl.cornell.edu Abstract

More information

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed

More information

A Low-Latency DMR Architecture with Efficient Recovering Scheme Exploiting Simultaneously Copiable SRAM

A Low-Latency DMR Architecture with Efficient Recovering Scheme Exploiting Simultaneously Copiable SRAM A Low-Latency DMR Architecture with Efficient Recovering Scheme Exploiting Simultaneously Copiable SRAM Go Matsukawa 1, Yohei Nakata 1, Yuta Kimi 1, Yasuo Sugure 2, Masafumi Shimozawa 3, Shigeru Oho 4,

More information

Fault-Free: A Framework for Supporting Fault Tolerance in FPGAs

Fault-Free: A Framework for Supporting Fault Tolerance in FPGAs Fault-Free: A Framework for Supporting Fault Tolerance in FPGAs Kostas Siozios 1, Dimitrios Soudris 1 and Dionisios Pnevmatikatos 2 1 School of Electrical & Computer Engineering, National Technical University

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

A Low Cost Checker for Matrix Multiplication

A Low Cost Checker for Matrix Multiplication A Low Cost Checker for Matrix Multiplication Lisbôa, C. A., Erigson, M. I., and Carro, L. Instituto de Informática, Universidade Federal do Rio Grande do Sul calisboa@inf.ufrgs.br, mierigson@terra.com.br,

More information

Increasing Reliability of Programmable Mixed-Signal Systems by Applying Design Diversity Redundancy

Increasing Reliability of Programmable Mixed-Signal Systems by Applying Design Diversity Redundancy Increasing Reliability of Programmable Mixed-Signal Systems by Applying Design Diversity Redundancy Gabriel de M. Borges, Luiz F. Gonçalves, Tiago R. Balen, Marcelo S. Lubaszewski Universidade Federal

More information

Distributed Systems Fault Tolerance

Distributed Systems Fault Tolerance Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable

More information

ReSpace/MAPLD Conference Albuquerque, NM, August A Fault-Handling Methodology by Promoting Hardware Configurations via PageRank

ReSpace/MAPLD Conference Albuquerque, NM, August A Fault-Handling Methodology by Promoting Hardware Configurations via PageRank ReSpace/MAPLD Conference Albuquerque, NM, August 2011. A Fault-Handling Methodology by Promoting Hardware Configurations via PageRank Naveed Imran and Ronald F. DeMara Department of Electrical Engineering

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Ultra Depedable VLSI by Collaboration of Formal Verifications and Architectural Technologies

Ultra Depedable VLSI by Collaboration of Formal Verifications and Architectural Technologies Ultra Depedable VLSI by Collaboration of Formal Verifications and Architectural Technologies CREST-DVLSI - Fundamental Technologies for Dependable VLSI Systems - Masahiro Fujita Shuichi Sakai Masahiro

More information

Dual Redundant Flight Control System Design for Microminiature UAV Xiao-Lin ZHANG 1,a, Hai-Sheng Li 2,b, Dan-Dan YUAN 2,c

Dual Redundant Flight Control System Design for Microminiature UAV Xiao-Lin ZHANG 1,a, Hai-Sheng Li 2,b, Dan-Dan YUAN 2,c 2nd International Conference on Electrical, Computer Engineering and Electronics (ICECEE 2015) Dual Redundant Flight Control System Design for Microminiature UAV Xiao-Lin ZHANG 1,a, Hai-Sheng Li 2,b, Dan-Dan

More information