The study of hardware redundancy techniques to provide a fault tolerant system

Size: px

Start display at page:

Download "The study of hardware redundancy techniques to provide a fault tolerant system"

Roy Lawrence
6 years ago
Views:

Cumhuriyet Üniversitesi Fen Fakültesi Fen Bilimleri Dergisi (CFD), Cilt:36, No: 4 Özel Sayı (2015) ISSN: 1300-1949 Cumhuriyet University Faculty of Science Science Journal (CSJ), Vol.

1 Cumhuriyet Üniversitesi Fen Fakültesi Fen Bilimleri Dergisi (CFD), Cilt:36, No: 4 Özel Sayı (2015) ISSN: Cumhuriyet University Faculty of Science Science Journal (CSJ), Vol. 36, No: 4 Special Issue (2015) ISSN: The study of hardware redundancy techniques to provide a fault tolerant system Mostafa SADEGHI 1, Hossein SOLTANI 2, Mohamadreza KHAYYAMBASHI 3 1 Department of Computer, Zavareh Branch, Islamic Azad University, Zavareh, Iran 2 Department of Computer, Ashkezar Branch, Islamic Azad University, Ashkezar, Iran 3 Department of Computer Engineering, University of Isfahan, Isfahan, Iran Received: ; Accepted: Abstract Increasing the reliability of computer systems operations is feasible by means of fault tolerance. This tolerance in a digital system is achieved through redundancy in hardware, software,or computation. This sort of redundancy can be performed in static, dynamic, or hybrid configuration. Hardware redundancy is obtained by providing two or more physical samples of a hardware component. In this paper, we study different hardware redundancy techniques.its efficiency and problems. Keywords: Fault tolerance, Hardware redundancy, TMR structure, reliability, availability. INTRODUCTION Any system which has the capability of conducting correct performance under the condition of fault in hardware or software, is called fault tolerant [1].Today,as computer systems are getting more complicated,because of lack of integrity in most parts of a system and necessity for intervention of various factors in output. it is vital to design a system that does not suffer from a major fault when there is a problem in one of its parts,and can maintain its correct performance,and simply by a change in overall efficiency can bring about the final goal. Digital systems have more critical tasks, therefore they need higher reliability.usual using design techniques and components with high quality do not decrease failure probability sufficiently. It means that systems must be fault tolerant. The most important technique so far used for fault tolerance in systems, is redundancy. Definitions of failure, fault, and error will be given later in this paper. Furthermore regarding hardware faults and its kinds.fault tolerance,purposes of designing fault tolerance and its usages,components of fault tolerance strategies,relation between redundancy and fault tolerance, hardware redundancy and its techniques, we will discuss finally, a conclusion of discussed issues will be offered[2]. Failures, Faults and Errors These 3 terminologies have different meaning: Failure-it indicates disability of a component to perform its predetermined task. Error-it is a sign of failure in system. In this case, the logic of a job is different from expected value. Failure in a system does not necessarily lead to an error.error occurs when there is a critical failure in the system. In other words, error happens when for a given condition of the input, incorrect output and consequence come out. Fault- It is an unusual physical case which occurs because of design error,such as mistakes in specifications, or configuring the system, or industrial problems.modeling and protection against failures due to designation errors and internal factors in tough, because anticipation of effects and their outcome is difficult [3,4]. * Corresponding authors. s: msadeghi@khuisf.ac.ir, soltaniyazdi@yahoo.com, m.r.khayyambashi@eng.ui.ac.ir Special Issue: Technological Advances of Engineering Sciences Faculty of Science, Cumhuriyet University

2 The study of hardware redundancy techniques to provide a fault tolerant system Specification of fault A fault can be classified by its duration, nature, and size. Duration of a fault may be transient or permanent.transient fault is normally the result of internal disturbance, and exists for a limited period and is irrevocable permanent or, hard faults are situations of the device that are not corrected by passing time. This kind of error results from component breakdown, physical detect of components or designation failure. A system with alternative fault alternates between failed and successful performance [1,5]. The nature of a fault is determined by its behavior in system. A logical fault produces errors which can be displayed as logic numbers, where as the errors resulting from indefinite fault don t have logic equivalent. The size of a fault is determined by the region affected with that failure. Local faults affect on individual components, while total fault influence several components.because of cost limitation, a lot of fault tolerance strategies merely reply to individual faults multiple failures require expensive failure models and total method for fault tolerance. Hardware faults These are categorized by considering duration, into permanent, transient,and alternative[6]. A permanent fault remains active until an activity is not corrected.this kind of fault is normally produced by some physical defects in hardware permanent faults are detected by online test methods that works with normal operation of system[7]. A transient fault is active for a short period of time if such a fault is activated alternately, it is called alternative fault. Because of their short period, transient faults are often detected through the errors resulted from their propagation. Alternate faults are usually called soft fault or glitches as well[8]. Philosophy of designation to overcome the error Generally speaking, there are 3 methods to overcome errors and maintain the system in its normal condition.these methods are described as follows: a. fault avoidance- it includes any technique applied to prevent fault or error. b. fault masking it consists of any procedure that after occurrence of fault, at least prevent the system from facing error. c. fault tolerance The ability of a system to continue its performance in spite of faults this factor relates to reliability, successful performance,and lack of collapse.a fault tolerant system must be able to manage the faults in hardware or software components, electrical break down, or any other unexpected defects. The main problem is that as complexity of a system increase, its reliability decreases, unless correcting criteria is considered.another problem is that although designers do their best to clear the system from software errors and hardware faults before the system is used, this goal isn t available, because some environment factors are inevitable and some user s mistakes are unpredictable.therefore, it is possible that faults are out of designer s control in some circumstances, when a system is perfectly designed and accomplished[9]. Applications of fault tolerance In many critical applied programs for security, trade, and spying, fault tolerance is necessary.critical security programs, such as where loss of life or environmental danger must be avoided, like aircraft control systems, radiotherapy mechanisms controlled by computer, guiding system for human heart or military radar.critical applied commercial programs are those that perform commercial jobs, such as trade system for bank transaction[1]. 237

3 SADEGHİ, SOLTANİ, KHAYYAMBASHİ Purpose of designing fault tolerance The objective of such designation is increase reliability by providing conditions for a system to continue its operation in spite of existing some inputting faults.it should be considered that a fault tolerant system does not necessarily provide high reliability, or that higher reliability does not necessarily mean fault tolerance.a main goal for a fault tolerant system might be that no single faults can fail the system[5]. Components of fault tolerance strategy Fault tolerance in a system is achieved through redundancy in hardware, software, information, or computation. This redundancy can be performed in static, dynamic, or hybrid configuration. A fault tolerant strategy consists of one (or more) of following factors: Masking dynamic correction of produced fault. Detection to detect an error(the sign of fault). Containment preventing propagation of an error in defined boundaries. Diagnosis to find the faulty module this is responsible for detected error. Repair /Reconfiguration to remove or replace a faulty component,or a mechanism to ignore it. Recovery-changing the condition of a system from faulty to acceptable for work.in order to have excellent performance of a secured file, when there is no time for detection and recovery of offline fault, a static or passive configuration is designed to hide as many faults as possible. On the other hand, dynamic redundancy is engaged by switching modules with further routing during occurrence a fault. In hybrid method, some faults are covered by static configuration, while faulty modules are detected and replaced. Hybrid redundancy is desirable for applied programs with high reliability in which the probability of appearing several faults is high. Fault tolerance and Redundancy There are different methods to achieve fault tolerance. The most common approach is existence of definite number of redundancy by definition, redundancy is predicting operational capabilities. There are two kinds of redundancy namely, space and time. Space redundancy provides redundant items, space, components, or function not necessary for a fault-free operation. This sort of redundancy is classified itself into hardware, software, and information, depending on the redundant source to the system. In time redundancy, calculation is repeated with data transfer, and the result is compared with the result of previous copy saved in the system[1]. Hardware redundancy This is obtained by making two (or more) physical samples of a hardware component. For example, a system may consist of extra processors, memories, buses, or power. Hardware redundancy is often the only available approach to increase reliability of a system, because other techniques such as using more qualified components are set aside, or in comparison with redundancy they are more costly[10]. There are 3 kinds of hardware redundancy: passive, active and hybrid. Passive method works as hiding the fault, while active redundancy is used for detection and recovery[4]. Passive redundancy It performs hiding the fault based on basic polling. This method covers and hides faults, instead of detecting them. Hide (or disguise) of a fault ensures that despite a fault, only correct data are transmitted to the output of system. One advantage of passive redundancy is that continuous operation is guaranteed, because any faults in redundant modules shows itself immediately, unless number of faulty modules is more than what a voter can bear (tolerate)[4]. In following paragraphs some techniques of passive redundancy is described. a. TMR technique- the TMR structure is a fault-tolerant architecture based on three identical modules which accomplish the same job. The inputs of these modules receive the same data that 238

4 The study of hardware redundancy techniques to provide a fault tolerant system are very close to each other their outputs supply a majority polling circuit[1-10].hence, TMR architecture reduces probability of error in primary output of system. The faculty module transmits a wrong value which can be hidden by means of two fault-free modules. In the simplest structure of TMR, voter is a weak point. If a problem appears in the voter, then TMR structure may get faulty. To avoid this problem, voter can be detected by more powerful software or designation techniques[2].the most common form of passive redundancy is TMR, whose basic configuration is indicated in figure 1[10]. Components are triple in order to perform identical calculations in parallel. Voter is utilized to determine the correct result.if one of voters fails, the voter of majority will hide the fault by results of the two fault-free modules. Input 1 Input 2 Voter Input 3 Module 3 Figure 1. TMR Technique TMR system can hide only one fault of a module.any failure in remaining modules will cause voter to produce wrong result. As long as other two modules work properly, a TMR system can perform correctly as well[9-10]. Suppose that voter is perfect and component failures are mutually exclusive, reliability of a TMR system is calculated by this formula: R TMR = R 1 R 2 R 3 + (1 R 1 )R 2 R 3 + R 1 (1 R 2 )R 3 + R 1 R 2 (1 R 3 ) The term R 1 R 2 R 3 explains probability that all 3 modules work properly. The term (1 R 1 )R 2 R 3 indicates probability that first module fails, while second and third ones act properly.the term R 1 (1 R 2 )R 3 states probability that first and third module work correctly but second module fails. The term R 1 R 2 (1 R 3 ) shows probability of first and second module working correctly, while third module fails. An accurate estimate for reliability of a TMR system, so that reliability of the voter is also taken into account, is as follows: R TMR = (3R 2 2R 3 )R 1 The voter is in combination with redundant modules, because in case the voter fails, the whole system fails. In order that reliability of a TMR system is much higher than a simple system, reliability of the voter must be very high. Fortunately in comparison with redundant components, voter is a simple unit. There for the probability of its failure is much lower[4,10]. Still in some systems, exiting only one breakdown point is unacceptable. In our definition a component is called the only breakdown point, when its failure leads to the failure of whole system. In this case, more complicated voting schematics are utilized. In order not to focus the system on only one voter, we extend it to three voters. In figure 2 such configuration is shown. No concentrated polling prevents the only breakdown point, but it requires unanimity of all three voters[2,4]. 239

5 SADEGHİ, SOLTANİ, KHAYYAMBASHİ Input 1 V1 1 Input 2 V2 2 Input 3 Module 3 V3 3 Figure 2. TMR with 3 Voter Polling can be performed as hardware or software. Hardware polling is usually fast enough to reply in any time boundary.if polling is accomplished by software voters, enough time may not be available[10]. x1 x2 f x3 Figure 3. Logical Diagram of 3 Input Voter A majority voter with 3 inputs for N digital data is shown in figure 3. The amount of output 1 is determined by majority values of inputs X 1, X 2 and X 3. Table 1 indicates definition for this polling. Table 1 F X1 X2 X Basically, TMR architecture can tolerate one fault; however in practice it may tolerate more than one fault. In fact, if there are two faults, TMR can operate properly depending. One the nature and place of fault.if errors causing the voter not to be driven in time, fault is easily tolerated by TMR structure[2,10].in order for the error not to be tolerated, two faults should be placed in two different modules and then an error is transmitted in identical outputs on each module. In figure 4 two samples with the same pattern are indicated that supply 3 modules. 240

6 The study of hardware redundancy techniques to provide a fault tolerant system f1 f1 f2 f2 a b Figure 4. Two Faults a)tolerable, b)untolerable Voter is omitted. Each fault is modeled as stuck-at fault ( F 1 and F 2 respectively ). In part A of figure 7, F 1 is moving towards output O 1 in first module and F 2 is sent to the output O 2 in second module. The voter receives two correct values and one incorrect value. Therefore, TMR output is correct and F 1 & F 2 are tolerated[3].in part b of figure 4, F 1 is moving towards O 1 and O 2, while F 2 is transmitted in O 2. The voter receives one wrong value for O 1 and two wrong values for O 2. As a result, the value appearing on the second output of TMR is a faulty value. Hence F 1 and F 2 are not tolerated. We can conclude that two faults are tolerated, when they come from two points of different modules, and occur on the identical outputs of each module. In case more than one fault happens, separate faults may be managed by considering all possible pair faults[2,10]. b. NMR Technique-Redundancy method of N modules (or NMR) is based on the same principles of TMR, but instead of 3 modules it uses N modules, as shown in figure 5. Input 1 Input 2 Voter Input n Module n Figure 5. NMR Technique N is usually chosen as an odd number so that majority voting is possible. An NMR system can disguise n/2 faults of modules [4,10]. Active Redundancy This type of redundancy is defined by detection of faults and taking responsibility of some activities for recovery. There are many techniques For fault detection.a method of effective detection of fault is utilizing hardware redundancy of component repetition with comparator. After fault detection, the system should be recovered quickly and accurately. 241

7 SADEGHİ, SOLTANİ, KHAYYAMBASHİ a. The technique of computation with Duplication - The basic from of active redundancy is repetition with comparison. It is shown in Figur6. Two identical modules perform identical computation in parallel. The result of computation is compared by means of a comparator. If the results don t match, an error signal is produced. A schematic of repetition with comparison can detect only one fault. Having detected the fault, no more activity is done by the system to return to operational mode. Input 1 Input 2 = Error Signal Figure 6. Comparison With Duplication b. Standby Sparing Technique -This technique is another form for active hardware redundancy. Only one of N modules is operational and provides the output of the system The remaining N-1 modules serve as spare. A spare is a redundant module which is necessary for normal performance of a system. Switch is a device that monitors the active module. Whenever an error is reported by the unit of FD error detection, it switches operations to an accessory[4]. There are two types of standby sparing, namely hot and cold. In the former, both operational and spare modules are on. Spares can immediately be switched and utilized after failure of operational modules. In the latter, spare modules are off until it is necessary to replace the faulty modules. One disadvantage of cold standby sparing is that it takes time to supply power to the module, to perform primary valuation, and to make renewed computation. One advantage Is that standby spares do not consume power. This is particularly important in applications such as satellite system in which power consumption is critical. Any standby sparing system with N modules can tolerate N-1 faults. Here, when we say tolerance, we mean the system detects faults, and then recovers them successfully to continue their service properly. When N th fault occurs, still it is detectable; however the system is not susceptible for recovery and returning to normal operation. C. Pair and Spare Technique- This technique combines the method of repetition and comparison with that of standby sparing. The idea is simulator to standby sparing, with the difference that instead of one module, two operational modules are working in Parallel here. Similar to the case of repetition and comparison, where the results are compare to detect the difference, if an error signal is received from comparator, the switch analyzes the report of error detection, and then determines the output of which module is faulty. The faulty module is deleted and a spare module replaces it[4]. A pair and-a spare system with N modules can tolerate N-1 faults. D. Using watch dog timer- In this technique we use a timer for error detection. By this method, from lack of an occurrence we conclude that some defect has happened in the system. This timer must be reset periodically. Any defect which causes this function not to perform, makes the system turn off. Thereby no major defects happen. Basic assumption in this technique is that system health causes the timer to be reset alternatively. HYBRID HARDWARE REDUNDANCY In hybrid approaches, attractive features of the both methods mentioned above are combined. Actually, this is the most common type of hardware redundancy. However, these methods are very costly and therefore only in the applications that fault tolerance is really necessary are utilized. One 242

8 The study of hardware redundancy techniques to provide a fault tolerant system of the most important techniques of this method is NMR with spares. This approach combines both ideas of NMR and standby sparing. Figure7 shows this smart idea. In each period, disagreement detector compares the referee output with the outputs of every single module. If there is any difference between the output of referee and that of module, the module is considered faulty and is replaced with a reserve module [8]. A. Duplex-Triplex Architecture: In this approach, as the Figure 8 indicates combines two ideas of Duplication with comparison and TMR. Utilizing TMR makes it possible to hide the error, while using the duplex hardware in addition to comparator causes the errors to be detected and therefore faulty module is omitted from voting process [8,9]. B. Self-purging Redundancy: another method for hybrid redundancy is self-purging. In this approach, as the Figure 9 indicates, each module has a switch that compares referee output with the output of module. In case there is no agreement, that module is omitted from the system. It should be noted that in this structure, judgment must be in the form of threshold gate[9]. CONCLUSION The objective of designing error tolerance system is to improve reliability by making it possible for the system to perform its operation, despite some input faults. It should be noted that an error tolerant system does not necessarily guarantee high reliability, or vice versa. A definite goal for a fault tolerant system is that no signal fault can cause system to fail. Hardware redundancy brings some penalties with itself, such as increasing weight, power consumption Size, designation time, manufacturing time, test time. The best method to apply redundancy in a system is determined by considering the above mentioned factors. For instance, increasing weight may be modified by applying redundancy to low level components. 243

9 SADEGHİ, SOLTANİ, KHAYYAMBASHİ Disagreement Detector Active Unit Inputs Disagreement Identification System Inputs Switch Voter Module N Spare 1 Spare M Fig. 7 Input 1a a S1 Input 1b b = Input 2a a S2 Voter Input 2b b = Input 3a Input 3b Module 3a Module 3b = S3 Fig. 8 Input 1 S1 Input 2 S2 Voter Input 3 Module n Sn Fig. 9 REFERENCES [1] H. Fu, M. Cai, L. Fang, P. Liu and J. Dongl, Research on RTOS-Integrated TMR for Fault Tolerant Systems,8 th ACIS International Conference on Software Engineering, Artificial Intelligence,Networking, and Parallel/Distributed Computing, IEEE, [2] J. Vial, A. Bosio, P. Girard, C. Landrault, S. Pravossoudovitch and A. Virazel, Using TMR Architectures for Yield Improvement, International Symposium on Defect and Fault Tolerance of VLSI Systems, IEEE Computer Society, [3] M. H. Mottaghi, H. R. Zarandi, "DFTS: A Dynamic Fault-Tolerant Scheduling for Real- Time Tasks in Multicore Processors," Elsevier Journal of Microprocessors & Microsystems, vol. 38, no. 1, PP , [4] M. Murakami, Task-based Dynamic Fault Tolerance for Humanoid Robots, Conference on systems, Man, and Cybernetics, IEEE, Taipei, Taiwan,October,2006. [5] H. Aliee, H. R. Zarandi, "A Fast and Accurate Fault Tree Analysis Based on Stochastic Logic Implemented on Field-Programmable Gate Arrays," IEEE Transactions on Reliability, vol. 61, no. 4, pp , [6] Ebrahimi, M. Mohammadi, A.. Ejlali, A., S., Miremadi, "A fast, flexible, and easy-todevelop FPGA-based fault injection technique", Elsevier Journal of Microelectronics Reliability, No. 54, pp , [7] Ghaderi, Z., Miremadi, S. G., Asadi, H., Fazeli, M., "HAFTA: Highly Available Fault- Tolerant Architecture to Protect SRAM-Based Reconfigurable` Devices Against Multiple Bit Upsets," IEEE Transactions on Device and Materials Reliability (TDMR), Vol. 13, No. 1, pp , March

10 The study of hardware redundancy techniques to provide a fault tolerant system [8] Sengupta, A, Bhadauria,S. Bacterial Foraging Driven Exploration of Multi Cycle Fault Tolerant Datapath based on Power-Performance Tradeoff in High Level Synthesis, Elsevier Journal on Expert Systems With Applications, [9] Eghbal, H. Pedram, P. Yaghini, H. R. Zarandi, "Designing a Fault-tolerant NoC Router Architecture Respecting Fault Effects," International Journal of Electronics, Francis & Taylor on Network-on-Chip, vol. 97, no. 10, pp , [10] Ebrahimi, M., Miremadi, S. G., Asadi, H., Fazeli, M., "A Low Cost Scan Chain-Based Technique to Recover Multiple Errors in TMR Systems," IEEE Transactions on Very Large Scale Integration Systems (TVLSI), Vol. 21, No. 8, pp , August

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques : Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.