Coded Calculation for Floating Point Values in Safety I&C - Implementation and Experiences Arndt LINDNER 1, Christian GERST 2, and Andreas MÖLLEKEN 3 1. TÜV Rheinland ISTec-GmbH, Boltzmannstr. 14, Garching, 85748, Germany (arndt.lindner@istec-gmbh.de) 2. TÜV Rheinland ISTec-GmbH, Boltzmannstr. 14, Garching, 85748, Germany (christian.gerst@istec-gmbh.de) 3. TÜV Rheinland ISTec-GmbH, Boltzmannstr. 14, Garching, 85748, Germany (andreas.moelleken@istec-gmbh.de) Abstract: The paper describes a methodology to detect erroneous floating point calculations in digital safety I&C during run time. The methodology has the potential to detect processor failures as well as memory failures. It is based on the extension of the normally used algebra to the complex number plain. In the complex number plain a set of sub-algebras is defined. The sub-algebras are characterized by a subset of valid numbers, the decision criteria for validity of a number and appropriate modified operations (addition, subtraction, multiplication, division). In case of a failure of the processor or the memory, the calculation in any of the sub-algebras will result in complex numbers that are not element of the set of elements of the sub-algebra. This is detected by the given criteria. The theoretical background of the methodology was already presented at the NPIC&HMIT conference in San Diego in 2012. The paper presents the extension of the methodology to logical functions and the implementation in a real I&C platform. The results of practical tests are given. This includes tests of calculation overhead and detection of typical failures. Additionally experiences regarding floating point precision are provided. Keyword: error detection, floating point calculation, complex number, Boolean functions 1 Introduction Due to the progress in digital I&C also safety I&C for nuclear power plants is implemented in systems based on complex processors or application specific integrated circuits, mostly Field Programmable Gate Arrays (FPGA). The advanced technology contrasts to the requirements to simple safety I&C for nuclear plants. It also gives cause for serious concern regarding malfunction of these complicated systems. The problem is approached by design of redundant and diverse systems increasing again complexity of the overall system. Especially, if two diverse systems produce different outputs, it is impossible to identify the correct one. Therefore, criteria seem to be very helpful to decide if the result of any calculation is correct or not. In the past procedures have been developed to enable decisions if a calculation was correct or not [1], [2], [3]. The methods are mostly bounded on processor technology, integer s, and assembler languages. An approach to apply these methods to modern I&C systems, programmed in high level languages is given in [4]. The algorithms are defined for integer variables and the modulo-operation is applied for data integrity tests. Since the modulo-operation is only defined for integers, these methods cannot be applied directly to floating point variables. An approach to extend the concept well established for integer variables to floating point variables has been developed and is described in [5]. In the last year two tasks have been performed and solved: 1. formulas for residual error probability have been established 2. tests to identify the calculation effort, failure sensitivity, and precision of calculation have been executed. The results of these tasks are presented in this paper. 2 Basic concepts The theoretical background of the concepts applied for integer variables are given in [3]. The theory that leads to the extension of the method to floating point variables is given in [5]. The tests have been performed with algorithms that are based on complex numbers. Other algebraic structures can also be used [5]. ISOFIC/ISSNP 2014, Jeju, Korea, August 24~28, 2014 1
Arndt LINDNER, Cristian GERST, Andreas MÖLLEKEN For identification of failures (memory or processor failures) the floating point s r are mapped to complex numbers z z = re ii = a + bb (1) Additionally a constant q = ce iϕ - with a capable c - is added to avoid zero s during calculation. z = (r + q)e ii (2) This process is called encoding of the r. Decoding is done by r = sig(arg(z q)) z q (3) Operations with the complex numbers z are then modified in such a way that the angel ϕ is preserved in the case of correct calculation. It is an invariant. A changed angel ϕ indicates a wrong result. All operations must be performed in Cartesian coordinates to avoid direct manipulation of ϕ or r [5]. To avoid failure annihilation (e.g. if both operands have the same failure in ϕ but with different signs), a specific term is added at the end of each function block (FB) [7]. 3 Types of data and function blocks For implementation of I&C functions different types of data and function blocks are needed. Fig. 1 shows a rough overview 4 Algorithms 4.1 Floating point calculations Operation with encoded s is modified to z 1 z 2 = (z 1 + z 2 q) (4) for the addition of the encoded s z 1 and z 2 and z 1 z 2 = z 1z 2 q(z 1 +z 2 q s) s (5) for the multiplication of the encoded s z 1 and z 2, where s = e iϕ. The complete set of calculation rules is included [5] and [7]. 4.2 Extension to Boolean variables Usually Boolean data are represented with their own data types. In this case the invariant ϕ is lost. Therefore the Boolean data v are in our approach also represented by floating point s, true by encoded 1 (= e iϕ + ce iϕ ) and false by encoded 0 (= e iϕ + ce iϕ ). Negation is given by v tttt v (6) and the logical AND is given by v 1 v 2 v 1 v 2 (7) This approach results in extensive calculation overhead (see paragraph 7.3). Nevertheless it benefits from preservation of the invariant ϕ and avoids every data type conversion. 5 Implementation The algorithms have been implemented in function blocks of the safety I&C platform TELEPERM XS [6]. The encoded signals are handled in the context of the prototypic implementation as two signals of the original platform in which one signal represents the real part and the other one the imaginary part of the encoded. Fig. 1 Single precision calculation The arrows in fig. 1 represent input and output relations of function blocks and data. The figure shows the central position of Boolean data which are the output of threshold function blocks. Fig. 2 Example of the FB CodAdd 2 ISOFIC/ISSNP 2014, Jeju, Korea, August 24~28, 2014
Coded Calculation for Floating Point Values in Safety I&C - Implementation and Experiences 3 This approach has the advantage that all software tools of the platform could be used for implementation of test cases and test execution. For test a representative set of function blocks was implemented. It includes encoding and decoding of data, arithmetic operations, logical functions, threshold FB, min/max FBs, and voting FBs. The status s of the signal records were not implemented [6]. 6 Residual failure probability Since floating point s are in contrast to integers not exactly representable, the invariant ϕ must always be seen as (ϕ-ε, ϕ+ε) with a small ε > 0. All s inside the green wedge in fig. 3 are correct ones. Thus, there is a certain residual failure probability P E that the result is wrong but indicated as correct (see fig. 3). 7 Tests 7.1 Selection of test cases The algorithms have been tested to study: - feasibility, - timing, - detection of bit errors, - precision of calculation. The first two issues were tested directly using a TELEPERM XS rack. The third issue was tested in a simulation environment (PC). 7.2 Feasibility test For this test a function diagram was implemented that was based on real I&C functions. The function diagram consists of four parts: 1. data acquisition and encoding, 2. calculation path 1, 3. calculation path 2, 4. decoding and failure test. The different calculation channels were implemented to use more different function blocks but not to compare the results. The function diagram is shown by fig. 4 to 7. Fig. 3 Residual failure probability P E The residual failure probability can be calculated as ratio of the areas. The ratio depends on ϕ and ε [7]. Since ε is small the expression for the residual failure probability can be simplified to Fig. 4 Data acquisition and encoding sin 2ε 2 sin 2ε < P E < sin 2ε 1 sin 2ε (8) The expression does not include ϕ. For ε = 0.0001 the residual failure probability is less than 1.75 10-6. ISOFIC/ISSNP 2014, Jeju, Korea, August 24~28, 2014 3
Arndt LINDNER, Cristian GERST, Andreas MÖLLEKEN complete function diagram was 6.64 ms. This example evidence the feasibility of the algorithms. 7.3 Timing The implemented algorithms were tested for timing especially the calculation overhead. The calculation time period of selected FBs and the overhead compared with the original function blocks of the platform is shown in table 1. Fig. 5 Calculation path 1 Calculation path 1 and 2 differ in the position of the threshold FB (CodGt). The output of the diagrams shown in fig. 5 and 6 are Boolean s. FB Table 1 Timing of selected FBs calculation time period [ms] calculation overhead factor CodAdd (addition) 0.0790 12.9 CodDiv (division) 0.0969 14.7 CodOr (logical OR) 0.1661 26.4 Cod2oo4 0.7716 83.0 CodMax4 (max. of 4 inputs) 1.4759 242.0 Cod2Max4 (2 nd max of 4 inputs) 4.7963 234.0 It shows a large overhead factor for logical FBs and a huge overhead factor for the max/min-function blocks. It shows also that the main time period of the function described in paragraph 7.2 results from the FB Cod2max4. Fig. 6 Calculation path 2 7.4 Detection of bit errors Test of bit error detection was performed for stuck-at-1, stuck-at-0, and alternating bit. The results for stuck-at-1 and stuck-at-0 bit errors are given in tables 2 to 4. Fig. 6 Decoding and failure test After formal specification of the test function on the basis of function diagrams the software was generated by the code generators of the TELEPERM XS platform. Then the code was loaded into the TELEPERM XS test system. The result of test runs was that all functions were performed correctly. The calculation time of the Fig.7 Representation of floating point s (32 bit) Fig. 7 shows the structure of the single precision floating point variables as used in the prototypical implementation. It follows the standard IEEE 754. Each of the tables shows the bit number, the relative error of the result due to the bit error and the statement if the bit error was 4 ISOFIC/ISSNP 2014, Jeju, Korea, August 24~28, 2014
Coded Calculation for Floating Point Values in Safety I&C - Implementation and Experiences 5 detected by the violation of the invariant ( ϕ - ϕ > ε). The tests were performed with 32 bit floating point variables, ϕ = 23 and ε = 0.001. A smaller for ε could not be used due to the limited precision of 32 bit floating point s (see also paragraph 7.5). Bit Table 2 Failure detection stuck-at: real part Stuck-at-1 Stuck-at-0 error of the error of the 0 0.00% No 0.00% No 1 0.00% No 0.01% No 2 0.02% No 0.00% No 3 0.00% No 0.03% No 4 0.07% No 0.00% No 5 0.15% No 0.00% No 6 0.29% No 0.00% No 7 0.58% No 0.00% No 8 0.00% No 1.16% No 9 2.34% No 0.00% No 10 4.68% Yes 0.00% No 11 0.00% No 9.23% Yes 12 18.91% Yes 0.00% No 13 0.00% No 35.31% Yes 14 0.00% No 59.39% Yes 15 >100.00% Yes 0.00% No 16 >100.00% Yes 0.00% No 17 0.00% No >100.00% Yes 18 0.00% No >100.00% Yes 29 >100.00% Yes 0.00% No 20 >100.00% Yes 0.00% No 21 0.00% No >100.00% Yes 22 0.00% No >100.00% Yes 23 >100.00% Yes 0.00% No 24 >100.00% Yes 0.00% No 25 >100.00% Yes 0.00% No 26 0.00% No >100.00% Yes 27 >100.00% Yes 0.00% No 28 >100.00% Yes 0.00% No 19 >100,00% Yes 0,00% No 30 0,00% No >100,00% Yes 31 >100,00% Yes 0,00% No Bit Table 3 Failure detection stuck-at: imaginary part Stuck-at-1 Stuck-at-0 error of the error of the 0 0.00% No 0.00% No 1 0.00% No 0.00% No 2 0.00% No 0.00% No 3 0.01% No 0.00% No 4 0.00% No 0.01% No 5 0.03% No 0.00% No 6 0.00% No 0.06% No 7 0.13% No 0.00% No 8 0.00% No 0.24% No 9 0.00% No 0.49% No 10 1.02% Yes 0.00% No 11 0.00% No 1.87% Yes 12 0.00% No 3.50% Yes 13 9.51% Yes 0.00% No 14 21.68% Yes 0.00% No 15 0.00% No >100.00% Yes 16 0.00% No >100.00% Yes 17 0.00% No >100.00% Yes 18 >100.00% Yes 0.00% No 29 >100.00% Yes 0.00% No 20 >100.00% Yes 0.00% No 21 >100.00% Yes 0.00% No 22 0.00% No >100.00% Yes 23 0.00% No >100.00% Yes 24 0.00% No >100.00% Yes 25 0.00% No >100.00% Yes 26 >100.00% Yes 0.00% No 27 >100.00% Yes 0.00% No 28 >100.00% Yes 0.00% No 19 >100.00% Yes 0.00% No 30 0.00% No >100.00% Yes 31 >100.00% Yes 0.00% No The best cases of bit error detection are marked in light green color and the worst cases in light red color. These issues depend on the precision of 32 bit floating point s (see paragraph 7.5) and the magnitude of ε. Table 2 contains the data for bit errors in the real part of the variables, table 3 for the bit errors in the imaginary part of the variables, and table 4 for bit errors in both, the real and the imaginary part of the variables. ISOFIC/ISSNP 2014, Jeju, Korea, August 24~28, 2014 5
Arndt LINDNER, Cristian GERST, Andreas MÖLLEKEN Bit Table 4 Failure detection stuck-at: real and imaginary part Stuck-at-1 Stuck-at-0 error of the error of the 0 0.00% No 0.00% No 1 0.00% No 0.01% No 2 0,02% No 0,00% No 3 0,01% No 0,03% No 4 0,07% No 0,01% No 5 0,18% No 0,00% No 6 0,29% No 0,06% No 7 0,71% No 0,00% No 8 0,00% No 1,41% No 9 2,34% Yes 0,49% Yes 10 5.65% No 0.00% No 11 0.00% No 11.29% No 12 18.91% Yes 3.50% Yes 13 9.51% Yes 35.31% Yes 14 21.68% Yes 59.39% Yes 15 >100,00% Yes >100.00% Yes 16 >100.00% Yes >100.00% Yes 17 0.00% No >100.00% Yes 18 >100.00% Yes >100.00% Yes 29 >100.00% Yes 0.00% No 20 >100.00% Yes 0.00% No 21 >100.00% Yes >100.00% Yes 22 0.00% No >100.00% Yes 23 >100.00% Yes >100.00% Yes 24 >100.00% Yes >100.00% Yes 25 >100.00% Yes >100.00% Yes 26 >100.00% Yes >100.00% Yes 27 >100.00% No 0.00% No 28 >100.00% No 0.00% No 19 >100.00% No 0.00% No 30 0.00% No >100.00% No 31 >100.00% Yes 0.00% No 7.5 Precision of calculation All tests mentioned above were executed on the TELEPERM XS platform. It uses single precision s. Failing error detection in the case of bit errors for bit 7 to 11 results from limited precision of 32 bit floating point s. In Fig. 8 calculations are shown with 32 bit floating point s executed on a Personal Computer (PC). The calculation was done using the same libraries as on the TELEPERM XS platform. These calculations evidence the need for the magnitude of ε 0.001 to avoid error detection due to rounding errors. Fig. 8 Single precision calculation In Fig. 9 calculations are shown with double precision s (64 bit). Rounding errors could not be observed in the calculation results (10 decimals). The results show that calculation with double precision floating point s is indispensable for application of the algorithms. There is another type of bit error detection for bits 27 to 30 if the bits are erroneous in the real and imaginary part concurrently. These bits belong to the biased exponent of the floating point s. Concurrent bit errors in the biased exponent of the real and the imaginary parts cannot be detected by coding with complex numbers, but the resulting s show huge deviations from the correct ones. Thus plausibility checks may detect these errors. Fig. 9 Double precision calculation 6 ISOFIC/ISSNP 2014, Jeju, Korea, August 24~28, 2014
Coded Calculation for Floating Point Values in Safety I&C - Implementation and Experiences 7 Bit error simulation with double precision floating point s (64 bit) on a PC prove the fact that problems with error detection in the fixed point part disappears. Concurrent bit errors in the biased exponent of the real and the imaginary parts can also not be detected. The step from single precision to double precision will somewhat influence the calculation time periods, because floating point calculations are executed by the Floating-point unit (FPU) of the processor. 8 Conclusions The paper describes a methodology to detect erroneous floating point calculations in digital safety I&C during run time. The algorithms detect processor failures as well as memory failures. The theoretical background of the methodology was presented at the NPIC&HMIT conference in San Diego in 2012 [5]. The paper presents the theoretical estimation of the residual failure probability and the results of tests regarding: - feasibility, - timing, - detection of bit errors, - precision of calculation. It was shown that the methodology is feasible to enhance error detection memory and processor but also address failures which results in wrong data. The computation of the algorithms produces higher processor loads and need use double precision data. Further investigations may help to optimize and enhance the algorithms. Maybe processor loads could be reduced by the development of specialized hardware modules like the FPU for floating point operations, e.g. based on Field Programmable Gate Arrays (FPGA). Additional applications as separation of signal channels on a processor or passivation of channels in case of calculation errors may be taken into account. Nomenclature FB FPGA FPU PC Function block Field Programmable Gate Array Floating-point unit Personal computer Coded addition Coded subtraction Coded multiplication Acknowledgement This work was performed in the project Development of error detecting calculation methods in high level languages for the use in safety I&C of nuclear power plants under the auspices of the German Federal Ministry for Economic Affairs and Energy under the project number 1501424 within its scope of research in reactor safety. References [1] P. Forin, Vital Coded Microprocessor Principles and Applications for Various Transit Systems, IFAC Control, Computers, Communications in Transportation, Paris, France 1989, pp.7984 (1989) [2] N. Oh, S. Mitra, E. J. McCluskey, ED4I: Error Detection by Diverse Data and Duplicated Instructions IEEE Transactions on Computers, Vol. 51, pp.180199 (2002) [3] F. Schiller, Einführung in die kodierte Verarbeitung, TU München, itm, 2007 [4] Ch. Gerst, Kodierte Signalverarbeitung in Leitsystemen der Kraftwerkstechnik, Diplomarbeit, TU München, itm (2008) [5] Arndt Lindner and Christian Gerst, A Methodology for Detection of Failures in Floating Point Calculations in Safety I&C Proc. of the 8 th International Topical Meeting on Nuclear Plant Instrumentation, Control and Human Machine Interface Technologies (NPIC & HMIT 2012), July 22-26, 2012 San Diego, California [6] TELEPERM XS User Manual, TXS CoreSW Release 3.6.4, PTLSDG/2012/en/0029 B [7] Christian Gerst, Arndt Lindner, Josef Märtz, and Andreas Mölleken, Entwicklung von Methoden zum Einsatz fehlererkennender Berechnungsverfahren in Hochsprachen für Sicherheitsleittechnik in KKW, ISTec-A-2543, in preparation ISOFIC/ISSNP 2014, Jeju, Korea, August 24~28, 2014 7