CUDA accelerated fault tree analysis with C-XSC

CUDA accelerated fault tree analysis with C-XSC Gabor Rebner 1, Michael Beer 2 1 Department of Computer and Cognitive Sciences (INKO) University of Duisburg-Essen Duisburg, Germany 2 Institute for Risk & Uncertainty University of Liverpool Liverpool, UK 19.09.2012 1 / 19

1 2 Verification CUDA Fault Tree Analysis 3 C++ and CUDA Evaluation 4 Future Work 2 / 19

of verified fault tree analysis in C++ using high-performance GPU 1 computing Issues Using GPU accelerated high-performance features to 1 Reduce the trade-off between computation accuracy and computation time 2 Use directed rounding based on the IEEE 754-2008 standard on the GPU 1 Graphics Processing Unit 3 / 19

Verification Verification CUDA Fault Tree Analysis Definition We use verification in its narrow sense of referring to a mathematical proof for correctness of a result obtained by a computer calculation. Tools Interval arithmetic provided by C-XSC Floating point arithmetic with directed rounding Central Processing Unit (CPU) Compute Unified Device Architecture (CUDA) 4 / 19

A short introduction to CUDA Verification CUDA Fault Tree Analysis Compute Unified Device Architecture (CUDA) High Performance GPU architecture Single Instruction, Multiple Data (SIMD) implementation Up to 2 10 CUDA cores on the NVIDIA GTX 590 Restriction to NVIDIA graphic cards Support of IEEE 754 floating point operations Double precision Directed rounding to the next floating point number (such as fl (x) and fl (x) with x R ) 5 / 19

Fault Tree Analysis Verification CUDA Fault Tree Analysis Fundamentals The implementation is based on The approach by Traczinsky et al. (2006) Verified on modern computer systems CUDA 6 / 19

Verification CUDA Fault Tree Analysis 7 / 19

Complexity Verification CUDA Fault Tree Analysis Computation step Each computation of a logical gate (AND- or OR-gate) has a complexity of O(n 3 ): Computation of each interval element (O (n n)) Computation of the mass assignment for each interval Total complexity: O(n 3 ) Improvements The algorithm can be improved to obtain an upper bound of complexity slightly smaller than O(n 3 ). 8 / 19

Verification under CUDA C++ and CUDA Evaluation Goal Compute correct results on computer systems using finite floating point arithmetic Approach Directed rounding (GPU source code) Interval arithmetic (C-XSC in CPU source code) 9 / 19

Interval Notation C++ and CUDA Evaluation Real Intervals (IR) x = [x, x] x x x, x, x and x R Machine Intervals (IF) x = [x, x] x x x, x, x and x F\{Not a number, ± } Description x is an interval from the set IR or IF x is the infimum/minimum of x x is the supremum/maximum of x 10 / 19

Verification under CUDA C++ and CUDA Evaluation Goal Compute correct results on computer systems using finite floating point arithmetic Problem Let x = 1 3 and x R 2 x + x 2 3 }{{} in floating point arithmetic 3 [fl (x + x), fl (x + x) ] }{{}}{{} lower bound upper bound 11 / 19

Verification under CUDA C++ and CUDA Evaluation Let x and y be two scale elements (intervals) and m x and m y the corresponding mass assignments Lower Failure Bound (OR-Gate) lb = fl ( fl ( x + y ) fl ( x y )) with x, y [0, 1], m lb = fl (m x m y ) with m x, m y [0, 1]. 12 / 19

Verification under CUDA C++ and CUDA Evaluation Let x and y be two scale elements (intervals) and m x and m y the corresponding mass assignments Lower Failure Bound (AND-Gate) lb = fl ( x y ) with x, y [0, 1] ub = fl (x y) with x, y [0, 1] m = fl (m x m y ) with m x, m y [0, 1]. 13 / 19

Computation time C++ and CUDA Evaluation Wall-clock time [s] spend on computation Configurations: Benchmark 1 (B1): n = 200, f = 20, l = 100 Benchmark 2 (B2): n = 5000, f = 100, l = 60 C++(LB) a C++(UB) DSI b (LB) DSI(UB) B1 7 7 1685 1712 B2 721 654 48070 46160 a C++ utilizing C-XSC and CUDA b DSI 3.5.2 and INTLAB V6 14 / 19

Computation time C++ and CUDA Evaluation 10 5 C++ & CUDA MATLAB & INTLAB 10 4 Wall-clock time [s] 10 3 10 2 10 1 benchmark 1 (LB) benchmark 1 (UB) benchmark 2 (LB) benchmark 2 (UB) Figure : Wall-clock time [s] spend on computation (logarithmic) 15 / 19

Future Work Achievements Reduction of the trade-off between accuracy and computation time Verified computation on the GPU using CUDA 16 / 19

Future Work Future Work Perspective Using high performance computing In MATLAB utilizing the MEX-Interface with CUDA and C-XSC To compute Markov set chains (imprecise Markov chains) 17 / 19

[1] Auer, E. ; Luther, W. ; Rebner, G. ; Limbourg, P.: A Verified MATLAB Toolbox for the Dempster-Shafer Theory. In: Proceedings of the Workshop on the Theory of Belief Functions www. udue. de/ DSIPaperone, http: // www. udue. de/ DSI, 2010 [2] Carreras, C. ; Walker, I.: Interval Methods for Fault-Tree Analyses in Robotics. In: IEEE Transactions on Reliability 50 (2001), 3 11. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00935010 [3] IEEE Computer Society: IEEE Standard for Floating-Point Arithmetic. In: IEEE Std 754-2008 (2008), 29, S. 1 58. http://dx.doi.org/10.1109/ieeestd.2008.4610935. DOI 10.1109/IEEESTD.2008.4610935 [4] Krämer, H.: C-XSC 2.0: A C++ Library for Extended Scientific Computing. In: Lecture Notes in Computer Science Bd. 2991/2004. Springer-Verlag, Heidelberg, 2004, S. 15 35 [5] Krämer, W. ; Zimmer, M. ; Hofschuster, W.: Using C-XSC for High Performance Verified Computing. Version: 2012. http://dx.doi.org/10.1007/978-3-642-28145-7_17. In: Jónasson, Kristján (Hrsg.): Applied Parallel and Scientific Computing Bd. 7134. Springer Berlin / Heidelberg, 2012. ISBN 978 3 642 28144 0, 168-178. 10.1007/978-3-642-28145-7 17 [6] NVIDIA: Plattform für Parallel-Programmierung und parallele Berechnungen. Website http://www.nvidia.de/object/cuda_home_new_de.html, [7] Rebner, G. ; Auer, E. ; Luther, W.: A verified realization of a Dempster Shafer based fault tree analysis. In: Computing 94 (2012), S. 313 324. http://dx.doi.org/10.1007/s00607-011-0179-3. DOI 10.1007/s00607 011 0179 3. ISSN 0010 485X 18 / 19

Thank you 19 / 19