Hardware Safety Integrity. Hardware Safety Design Life-Cycle

Hardware Safety Integrity Architecture esign and Safety Assessment of Safety Instrumented Systems Budapest University of Technology and Economics epartment of Measurement and Information Systems Hardware Safety esign Life-Cycle

input interface output interface Fundamental Concepts Risk Reduction and Risk Reduction Factor (RRF) Safety Lifecycle Safety Integrity Level (SIL) Safety Instrumented System (SIS) Safe Failure Fraction (SFF) Independence Levels and consequences Proof Test Interval between two proof tests (T[Proof]) Probability of Failure on emand (PF) Failure Rate (λ) Failure In Time (FIT) Reliability Availability Mean Time To Failure (MTTF) Mean Time Between Failure (MTBF) Mean Time To Repair (MTTR) According to IEC 61508 and IEC 61511 Sensor subsystem Safety Instrumented System (SIS) (sensors and input interface) Logic subsystem (output interface and final elements) Final element subsystem S 1 S F 1 F Process

Basic Notation T 1» Proof-test interval (h) MTTR» Mean time to restoration (hour) C» iagnostic coverage o expressed as a fraction in the equations and as a % elsewhere β» The fraction of undetected failures that have a common cause o expressed as a fraction in the equations and as a % elsewhere β» Of those failures that are detected by the diagnostic tests, the fraction that have a common cause o expressed as a fraction in the equations and as a % elsewhere o tables assume β = β λ» Failure rate (per hour) of a channel in a subsystem angerous and Safe Failure Rates λ» angerous failure rate (per hour) of a channel in a subsystem, equal to 0,5 λ o assumes 50 % dangerous failures and 50 % safe failures λ» etected dangerous failure rate (per hour) of a channel in a subsystem o the sum of all the detected dangerous failure rates within the channel of the subsystem λ» Undetected dangerous failure rate (per hour) of a channel in a subsystem o the sum of all the undetected dangerous failure rates within the channel of the subsystem λ S» etected safe failure rate (per hour) of a channel in a subsystem o the sum of all the detected safe failure rates within the channel of the subsystem 5 6 S S SU

Probability of Failure on emand PF SYS» Average probability of failure on demand of a safety function for the E/E/PE safety-related system o PF G» Average probability of failure on demand for the group of voted channels If the sensor, logic or final element subsystem comprises of only one voted group, then PF G is equivalent to PF S, PF L or PF FE respectively o PF S» Average probability of failure on demand for the sensor subsystem o PF L» Average probability of failure on demand for the logic subsystem o PF FE» Average probability of failure on demand for the final element subsystem Probability of Failure per Hour PFH SYS» Probability of failure per hour of a safety function for the E/E/PE safety-related system o PFH G» Probability of failure per hour for the group of voted channels if the sensor, logic or final element subsystem comprises of only one voted group, then PFH G is equivalent to PFH S, PFH L or PFH FE respectively o PFH S» Probability of failure per hour for the sensor subsystem o PFH L» Probability of failure per hour for the logic subsystem o PFH FE» Probability of failure per hour for the final element subsystem 7 8

and Group Equivalent own Time t CE» equivalent mean down time (hour) for 1oo1, 1oo, oo and oo3 architectures o the combined down time for all the components in the channel of the subsystem t GE» Voted group equivalent mean down time (hour) for 1oo and oo3 architectures o the combined down time for all the channels in the voted group t CE» equivalent mean down time (hour) for 1oo architecture o the combined down time for all the components in the channel of the subsystem t GE» Voted group equivalent mean down time (hour) for 1oo architecture o the combined down time for all the channels in the voted group Safety integrity level (SIL) SIL levels (Safety Integrity Level) RRF (Risk Reduction Factor) PF avg (Average Probability of Failure on emand) 9 Safety integrity level (SIL) Probability of Failure on emand (PF) Risk Reduction Factor (RFF) 4 10-5 to < 10-4 100000 to 10000 3 10-4 to < 10-3 10000 to 1000 10-3 to < 10-1000 to 100 1 10 - to < 10-1 100 to 10 PF avg Tolerable frequency of the accident Frequency of the accident without protection 1 Risk Reduction Factor 10

SFF (Safe Failure Fraction) SFF summarizes the fraction of failures that o lead to a safe state or o will be detected by a diagnostic measure and lead to a defined safety action A hardware fault tolerance of N means that N + 1 faults could cause a loss of the safety function SFF Safe failure fraction S S Hardware fault tolerance 0 1 < 60% SIL 1 SIL SIL 3 60% to < 90% 90% to < 99% SIL SIL 3 SIL 4 SIL 3 SIL 4 SIL 4 99% SIL 3 SIL 4 SIL 4 Type A and Type B Subsystems A subsystem can be regarded as type A if a) the failure modes of all components are well defined; and b) the behavior of the subsystem under fault conditions can be completely determined; and c) there is sufficient dependable failure data from field experience to show that the claimed rates of failure for detected and undetected dangerous failures are met A subsystem shall be regarded as type B if a) the failure mode of at least one component is not well defined; or b) the behavior of the subsystem under fault conditions cannot be completely determined; or c) there is insufficient dependable failure data from field experience to support claims for rates of failure for detected and undetected dangerous failures 1

Type A and Type B Architectural Constraints Type A Safe failure fraction Hardware fault tolerance 0 1 < 60% SIL 1 SIL SIL 3 60% to < 90% 90% to < 99% SIL SIL 3 SIL 4 SIL 3 SIL 4 SIL 4 99% SIL 3 SIL 4 SIL 4 well defined failure modes; and completely determined behavior under fault; and sufficient dependable failure data Type B Safe failure fraction < 60% 60% to < 90% 90% to < 99% Hardware fault tolerance 0 1 Not allowed SIL 1 SIL 3 SIL SIL 3 SIL 4 SIL 3 SIL 4 SIL 4 99% SIL 3 SIL 4 SIL 4 at least one component failure mode is not well defined; or not completely determined behavior under fault; or insufficient dependable failure data Example: Multiple s of Subsystems Subsystems implementing safety function 1 Type B SIL 3 Type A SIL 3 Type A SIL 4 Type B SIL 5 Type B SIL 1 14

Example: Multiple s of Subsystems Architecture reduces to If a safety function is implemented through a single channel, the maximum hardware SIL that can be claimed for the safety function shall be determined by the subsystem that has met the lowest hardware SIL requirements 1 and 4 and 5 SIL SIL 1 3 Type A SIL 15 Multiple s of Subsystems In E/E/PE safety-related systems where a safety function is implemented through multiple channels of subsystems, the maximum hardware SIL that can be claimed for the safety function under consideration shall be determined by a) assessing each subsystem against the requirements of the Type A or Type B SFF table b) grouping the subsystems into combinations; and c) analyzing those combinations to determine the overall hardware safety integrity level 16

Example: Multiple s of Subsystems Architecture reduces to 1,, 4 and 5 3 Type A SIL 3 SIL In the event of a fault occurring in the combination of subsystems 1 and, the safety function could be performed by the combination of subsystems 4 and 5 To take account of this effect, the hardware fault tolerance achieved by the combination of subsystems 1 and is increased by 1 Increasing the hardware fault tolerance by 1 has the effect of increasing the hardware safety integrity level by 1 (see SFF Table) Common Cause Failures The failures of a system arise from two causes: o random hardware failures o systematic failures Common cause failures result from a single cause, but (may) affect more than one channel o may result from a systematic fault (e.g. a design or specification mistake) o from external stress leading to an early random hardware failure (e.g. excessive temperature due to the failure of a common cooling fan) o or a combination of both They do not (necessarily) all manifest themselves simultaneously in all channels 17 18

Model and Means to Reduce Probability of CCF Failures of channel 1 Common cause failures affecting both channels Failures of channel Three avenues that reduce the probability of potentially dangerous common cause failures: a) Reduce the number of random hardware and systematic failures overall b) Maximize the independence of the channels c) Reveal non-simultaneous common cause failures while only one, and before a second, channel has been affected, i.e. use diagnostic tests Using the β-factor to calculate PF due to CCF The probability of dangerous common cause failures without self-diagnosis λ β where o λ is the probability of dangerous random hardware failures for each individual channel and o β is the fraction of single-channel failures that affect all channels The overall probability of failure due to dangerous CCF with self-diagnosis λ β + λ β where o λ is the probability of an undetected failure of a single channel o β is the common cause failure factor for undetectable dangerous faults, which is equal to the overall β-factor in the absence of diagnostic testing o λ is the probability of a detected failure of a single channel o β is the common cause failure factor for detectable dangerous faults. As the rate of diagnostic testing is increased, value of β falls increasingly below β 19 0

Secondary Subsystem SIL Rating Secondary Subsystem SIL Rating Secondary Subsystem SIL Rating Secondary Subsystem SIL Rating SIL Ratings for Combined Subsystems 0.5% Common Cause Failures Primary Subsystem SIL Rating SIL 1 SIL SIL 3 SIL 1 SIL 1 SIL SIL 3 SIL SIL SIL 3 SIL 4 SIL 3 SIL 3 SIL 4 > SIL 4 1% Common Cause Failures Primary Subsystem SIL Rating SIL 1 SIL SIL 3 SIL 1 SIL 1 SIL SIL 3 SIL SIL SIL 3 SIL 4 SIL 3 SIL 3 SIL 4 SIL 4 5% Common Cause Failures Primary Subsystem SIL Rating SIL 1 SIL SIL 3 SIL 1 SIL 1 SIL SIL 3 SIL SIL SIL 3 SIL 4 SIL 3 SIL 3 SIL 4 SIL 4 10% Common Cause Failures Primary Subsystem SIL Rating SIL 1 SIL SIL 3 SIL 1 SIL 1 SIL SIL 3 SIL SIL SIL 3 SIL 3 SIL 3 SIL 3 SIL 3 SIL 3 Avoidance of Systematic Faults 1

Assumptions the hardware failure rates used as inputs to the calculations and tables are for a single channel of the subsystem the channels in a voted group all have the same failure rates and diagnostic coverage for each safety function, there is perfect proof testing and repair the proof test interval is at least an order of magnitude greater than the diagnostic test interval for each subsystem there is a single proof test interval and mean time to restoration the expected interval between demands is at least an order of magnitude greater than the mean time to restoration Average Probability of Failure on emand The average probability of failure on demand (PF avg ) of a safety function is determined by calculating and combining the PF avg for all the subsystems which together provide the safety function PF SYS = PF S + PF L + PF FE where o PF SYS is the average probability of failure on demand of a safety function for the E/E/PE safety-related system o PF S is the average probability of failure on demand for the sensor subsystem; o PF L is the average probability of failure on demand for the logic subsystem; and o PF FE is the average probability of failure on demand for the final element subsystem 4 5

etermining the PF avg for Each Subsystem a) raw the block diagram showing the subsystems b) For each voted group in the subsystem, select from the relevant table o the architecture (e.g. oo3) o the diagnostic coverage of each channel (e.g. 60%) o the λ failure rate (per hour) of each channel (e.g. 5 10-6 ) o the common cause failure β-, β and β factors (e.g. % and 1% respectively) c) Obtain, from the relevant table the average probability of failure on demand for the voted group d) If the safety function depends on more than one voted group of sensors or actuators, the combined average probability of failure on demand is PF PFGi PFFE S ; i 1oo1 Architecture 6 j PF Gj λ iagnostics t c λ λ T1 MTTR t MTTR 1 c t CE t CE t c1 T1 tc MTTR 1 C; C tce PF 1e tce since tce PF G t CE 1 MTTR

1oo Architecture λ iagnostics 1oo λ λ CCF t GE channel equivalent mean down time t CE system equivalent down time t GE PF t t CE GE T1 MTTR T1 MTTR 3 MTTR MTTR T (1 ) (1 ) t t MTTR MTTR 1 CE GE oo Architecture iagnostics oo λ λ t CE λ λ λ λ t CE channel equivalent mean down time t CE average probability of failure on demand PF G t CE PF T1 MTTR tce CE G t MTTR

1oo Architecture t CE PF G iagnostics iagnostics channel equivalent mean down time t CE system equivalent down time t GE 1oo t t CE' GE' T1 T1 3 λ λ S t GE MTTR S MTTR 1 (1 ) (1 ) t t MTTR MTTR λ 1 S CE GE S λ S MTTR S T CCF MTTR oo3 Architecture iagnostics λ λ t CE λ oo3 oo3 CCF t GE channel equivalent mean down time t CE system equivalent down time t GE PF t t CE GE T1 MTTR T1 MTTR 3 MTTR MTTR T (1 ) (1 ) t t MTTR MTTR 1 6 CE GE

Recommended Reading on Calculation of PF Tieling Zhang, Wei Long and Yoshinobu Sato: Availability of systems with self-diagnostic components applying Markov model to IEC 61508-6 Reliability Engineering & System Safety Volume 80, Issue, May 003, Pages 133-141 doi:10.1016/s0951-830(03)00004-8 Received 11 ecember 000; accepted 19 ecember 00; available online 7 February 003 3 Case Study: Pressure Relief System Illustration of how IEC 61508 may be applied in a practical case

The Equipment Under Control The EUC is a pressure vessel, used in a batch process that has a weekly cycle o It is brought, in a controlled manner, to a prescribed pressure using a control loop o The perceived hazard is that the control system might fail, subjecting the vessel to overpressure The final safeguard is a bursting disc, which discharges to a stack, releasing the contents of the vessel into the atmosphere o It is considered to be 100% reliable but its operation is undesirable for environmental and public relations reasons An acceptable level of risk is a 10% probability of a release once in the plant s expected life of ten years Pressure Relief System Installation 34 35

Risk Assessment An acceptable level of risk is a 10% probability of a release once in the plant s expected life of ten years, or once per 10 6 hours The Equipment Under Control risk (EUC risk) is once per year, or once per 10 4 hours The required average probability of failure on demand (PF avg ) of the safety function is 10 - Risk reduction factor is 100: SIL Pressure Relief System Scheme 36 Actuator system Isolator Trip amplifier Isolator Trip amplifier Pressure transmitter 37

Safety Function Realization Pressure transmitter Isolator Trip amplifier Trip amplifier Isolator Actuator system Pressure transmitter Isolator Trip amplifier Isolator Actuator system Architecture 1oo1 1oo1 1oo 1oo1 1oo1 Undetected 1 x 10 dangerous failure -5 rate, λ per year 1. x 10-4 per year 1. x 10-5 per year 0 1 x 10-3 per year Proof test interval 1 year 1 year 1 year 1 week Probability of failure on demand 0.5 x 10-5 6 x 10-5 0.6 x 10-5 0 1 x 10-5 Subsystem type Type A Type A Type B Type A Type A SIL rating > SIL SIL x SIL 1 SIL 3 SIL 4 38