Towards Transactional Memory for Safety-Critical Embedded Systems Stefan Metzlaff, Sebastian Weis, and Theo Ungerer Department of Computer Science, University of Augsburg, Germany Euro-TM Workshop on Transactional Memory April 14, 2013 WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 1
Motivation Safety-critical embedded systems Avionics or automotive domain Real-time constraints Fault tolerance constraints Different certification requirements (SL 1-4, DAL A-E) Trend towards High performance Low power E.g. autonomous driving, Multi-core processors and parallel applications WTM13 A380, [1] Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems Google Driverless Car, [2] 2
Motivation Transactional memory in safety-critical systems Concurrency control Predictable execution in multi-cores Real-time capable concurrency Bounding communication interferences Fault tolerance Fault containment Fault detection Fault recovery WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 3
Real-Time, Multi-Core & TM nterferences at shared resources Access to bus, memory, and /O Predictable arbitration with bandwidth guarantees (e.g. TDMA) Concurrency control nterferences at application level Requirements for hard real-time (HRT) TM Commit guarantee for each transaction Calculable number of transaction aborts HRT contention management Related work: [Fahmy et al. 2009] and [Schoeberl et al. 2010] Core 1 Core 2 Cache Memory Bus /O Device WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 4
HRT-TM Design Overview Lazy versioning No cascading roll-backs Predictable transactions by commit ordering FFO transaction commit queue Registering transactions on transaction begin Commit serialisation Bounded number of aborts and transaction delay 1 2 3 4 Running Waiting Commiting Aborting Allows estimation of WCET bounds (requires the set of concurrent transactions) Predictable concurrency control in shared memory systems WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 5
HRT-TM Enhancement for Non Real-Time Applications with tasks of different RT requirements E.g.: Advanced Driver Assistance System Hard real-time (HRT): collision avoidance Soft real-time (SRT): night vision Best-effort (BE): traffic sign recognition Data sharing among applications Limiting interference of non-hrt tasks Prioritised TM contention manager nterferences only during commit of BE task Analysis requires profiling BE working sets Preliminary results: minimal impact of BE tasks on WCET bounds of HRT tasks Collision Avoidance, HRT, [3] Night Vision, SRT, [4] Traffic Sign Recognition, BE, [5] WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 6
Fault Tolerance & TM Encapsulation of vulnerable code in transactions Redundant execution of transactions Fault model Core: transient and permanent faults nterconnect: transient faults only LLC & Memory: protected by ECC (not covered in this work) Related work: [Yalcin et al. 2010] and [Sanchez et al. 2010] Permanent Faults Core Local Memory Memory Core Local Memory Bus / nterconnect LLC Transient Faults WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 7
FT-TM Fault Detection and Recovering Fault containment: lazy versioning TM Fault detection: redundant execution of TXs Spatial, temporal, or both cannot change global state Comparison of write sets of s and register sets Fault recovery: check-pointing system state State of memory already managed by TM Register set needs to be saved on TX begin Rollback to TX begin on fault via TX retry Fault Containment Fault-Detection Fault Recovery Contention Manager WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 8
FT-TM Levels of Fault Tolerance Tasks with different FT properties Low or high error rate HRT or BE requirements Fault detection and recovery schemes (1) 1 core, > 2 execution time overhead on fault (transient only) (2) 2 cores, > 1 execution time overhead on fault (3) 3 cores, < 1 execution time overhead on fault (1) (2) (3) Fault-Detection Fault-Detection...... Roll-back Recovery Fault-Detection Send Commit Forward Error Correction Towards an individual level of fault tolerance for each task WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 9
Conclusions and Future Work Transactional memory for safety-critical embedded systems Hard real-time: isolation and predictability Fault tolerance: fault containment, detection, and recovery Mixed criticality systems: different requirements for tasks Future work Enhance HRT-TM by soft real-time support Fault recovery schemes for FT-TM ntegration of real-time and fault tolerance WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 10
Questions? References: [1] http://www.flickr.com/photos/8313254@n08/496320750/ [2] http://www.flickr.com/photos/jurvetson/5499949739/ [3] http://www.flickr.com/photos/13524418@n07/2921138655/ [4] http://www.flickr.com/photos/jurvetson/22226826/ [5] from Eichner, M.L.; Breckon, T.P., ntegrated speed limit detection and recognition from real-time video, ntelligent Vehicles Symposium, pp.626-631, 2008, EEE WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 11
References: [Fahmy et al. 2009]: S. F. Fahmy, B. Ravindran, and E. D. Jensen. On bounding response times under software transactional memory in distributed multiprocessor real-time systems. DATE, 2009 [Schoeberl et al. 2010]: M. Schoeberl, F. Brandner, and J. Vitek. RTTM: real-time transactional memory. SAC, 2010 [Sanchez et al. 2010]: D. Sanchez, J.L. Aragon, and J.M. Garcia. A log-based redundant architecture for reliable parallel computation. HiPC, 2010. [Yalcin et al. 2010]: G. Yalcin, O. Unsal,. Hur, A. Cristal, and M. Valero. FaulTM: Fault-Tolerance Using Hardware Transactional Memory. Pespma, 2010. WTM13 Metzlaff, Weis, and Ungerer / TM for Safety-Critical Embedded Systems 12