Diagnosis in the Time-Triggered Architecture

Similar documents
TU Wien. Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12.

TU Wien. Excerpt by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12.

CORBA in the Time-Triggered Architecture

Chapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao

Dependability tree 1

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007

OMG Smart Transducer Specification (I)

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Fundamental Design Principles for Embedded Systems: The Architectural Style of the Cross-Domain Architecture GENESYS

Real-Time System Modeling. slide credits: H. Kopetz, P. Puschner

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

Distributed Systems 24. Fault Tolerance

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Issues in Programming Language Design for Embedded RT Systems

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Compositional Design of RT Systems: A Conceptual Basis for Specification of Linking Interfaces

SPIDER: A Fault-Tolerant Bus Architecture

FAULT TOLERANT SYSTEMS

Fault Tolerance. Distributed Software Systems. Definitions

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Distributed Systems (ICE 601) Fault Tolerance

Failure Tolerance. Distributed Systems Santa Clara University

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Modeling and Verification of Distributed Real-Time Systems using Periodic Finite State Machines

Distributed Systems 23. Fault Tolerance

Amrita Vishwa Vidyapeetham. ES623 Networked Embedded Systems Answer Key

Safety-critical embedded systems, fault-tolerant control systems, fault detection, fault localization and isolation

The Time-Triggered Architecture

Time Handling in Programming Language

Chapter 8 Fault Tolerance

Fault Tolerance via the State Machine Replication Approach. Favian Contreras

Fault Tolerance. Distributed Systems. September 2002

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

The Time-Triggered Ethernet (TTE) Design

An Encapsulated Communication System for Integrated Architectures

16 Time Triggered Protocol

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Chapter 8 Fault Tolerance

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

A Fault Management Protocol for TTP/C

Fault Tolerant Computing CS 530

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

Introduction to the Distributed Real-Time System

Functional Safety and Safety Standards: Challenges and Comparison of Solutions AA309

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

EXPERIENCES FROM MODEL BASED DEVELOPMENT OF DRIVE-BY-WIRE CONTROL SYSTEMS

Fault tolerance and Reliability

A CAN-Based Architecture for Highly Reliable Communication Systems

Why Things Break -- With Examples From Autonomous Vehicles ,QVWLWXWH IRU &RPSOH[ (QJLQHHUHG 6\VWHPV

Fault Tolerance. Distributed Systems IT332

Fault Diagnosis and Fault Tolerant Control

A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Last Class:Consistency Semantics. Today: More on Consistency

Dependable Computer Systems

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

Basic vs. Reliable Multicast

Component-Based Design of Large Distributed Real-Time Systems

Fault Tolerance. The Three universe model

Mixed-Criticality Systems based on a CAN Router with Support for Fault Isolation and Selective Fault-Tolerance

Dep. Systems Requirements

Pattern-Based Analysis of an Embedded Real-Time System Architecture

Part 2: Basic concepts and terminology

A Comparison of LIN and TTP/A

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Failure Diagnosis and Prognosis for Automotive Systems. Tom Fuhrman General Motors R&D IFIP Workshop June 25-27, 2010

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect

High Availability and Disaster Recovery Solutions for Perforce

Modeling technique and a simulation tool for analysis of clock synchronization in communication networks

A Diagnostic Framework for Integrated Time-Triggered Architectures

Network Survivability

Real-Time Communication

Time Triggered and Event Triggered; Off-line Scheduling

Today: Fault Tolerance. Replica Management

POWER SYSTEM DATA COMMUNICATION STANDARD

SIS Operation & Maintenance 15 minutes

Today: Fault Tolerance

IEEE 1588 PTP clock synchronization over a WAN backbone

Implementation Issues. Remote-Write Protocols

AOSA - Betriebssystemkomponenten und der Aspektmoderatoransatz

Real Time Monitoring of

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

FAULT TOLERANT SYSTEMS

Replication in Distributed Systems

TSW Reliability and Fault Tolerance

ISO INTERNATIONAL STANDARD. Road vehicles FlexRay communications system Part 2: Data link layer specification

Experience in Developing Model- Integrated Tools and Technologies for Large-Scale Fault Tolerant Real-Time Embedded Systems

SAN FRANCISCO, CA, USA. Ediz Cetin & Oliver Diessel University of New South Wales

Enabling Increased Safety with Fault Robustness in Microcontroller Applications

Cyber-physical Systems-of-Systems: the AMADEOS approach and Main Advances

An Introduction to TTEthernet

殷亚凤. Synchronization. Distributed Systems [6]

FlexRay International Workshop. Protocol Overview

Overcoming Rack Power Limits with VPS Dynamic Redundancy and Intel RSD

How Emerson s I/O on Demand Is Changing the Automation Infrastructure

Fault Tolerance in Parallel Systems. Jamie Boeheim Sarah Kay May 18, 2006

DRAFT. Dual Time Scale in Factory & Energy Automation. White Paper about Industrial Time Synchronization. (IEEE 802.

Transcription:

TU Wien 1 Diagnosis in the Time-Triggered Architecture H. Kopetz June 2010

Embedded Systems 2 An Embedded System is a Cyber-Physical System (CPS) that consists of two subsystems: A physical subsystem the P-System--that is controlled by the laws of physics and is based on a dense model of time, and. A (distributed) computer system the C-system-- that is controlled by computer programs and is based on a discrete model of time. There are different models of time in these two subsystems: dense time in the physical system and discrete (or sparse) time the cyber system.

mbedded System: Physical World meets Cyber World 3 P-System j k C-System Real Time Com. System m n l Event in the P System Interface Node of the C-System

System Boundaries are not Clear-Cut 4 In a large plant, the boundaries between the physical system and cyber-system cannot be established easily. If we partition a plant into components, many components will belong to both worlds, e.g., a smart sensor or a smart actuator. We have to take a system view, where the behavior of components is the concern, irrespective of whether they belong to the physical world or the cyber world.

The Grand Vision 5 Reduce unscheduled maintenance in industrial plants to zero. Produce cars that will never fail on the road. Deliver embedded systems with close to 100% availability for 24 hours per day and 7 days per week.

Reality is not Technology Paradise 6 Every piece of hardware in the plant and in the computer system will eventually fail. Failures will occur sporadically (unforeseen) or can, to some extent, be anticipated (foreseen). The design (software) is not free of design errors A successful system will evolve which leads to changes in the specification.

Six Steps to Take 7 (1) Identify and design the planned field-replaceable units (FRU) with appropriate properties. (2) Characterize the FRUs depending on their expected failure rate. (3) Observe the behavior of FRUs and collect data about anomalous behavior. (4) Analyze the data about anomalous behavior, either online or offline to locate faulty FRUs (5) Replace the faulty FRU (6) Restart the replaced FRU

Outline of the Talk 8 We will analyze each one of these six steps and show which mechanisms of the the Time- Triggered Architecture support each one of these steps.

Complexity Management 9 In the Time-triggered Architecture (TTA) the design of the diagnostic subsystem is driven by the principle Divide and Conquer meaning that the diagnostic subsystem is, as far as possible, completely independent from the rest of the system. There should be no unintended interdependence, at the hardware or software level, of these two subsystems.

Six Steps to Take 10 (1) Identify and design the planned field-replaceable units with appropriate properties (FRU). (2) Characterize the FRUs depending on their expected failure rate. (3) Observe the behavior of FRUs and collect data about anomalous behavior. (4) Analyze the data about anomalous behavior, either online or offline to locate faulty FRUs (5) Replace the faulty FRU (6) Restart the replaced FRU

(1) FRU-Design 11 The size and accessibility of FRUs is determined by the maintenance strategy Both the physical system and the control system of an embedded system must be partitioned into FRUs The size of an FRU has an influence on the reliability of an FRU. FRUs with software, where design faults can be corrected by reloading a new version of the software, need special attention.

Size versus Reliability of an FRU 12 An FRU must have maintainable interfaces such that it can be replaced easily within a short time. A maintainable interface is less reliable than a solid interface, e.g., plug vs. solder connections. A non-maintainable subsystem, where all connection are solid, is the most reliable subsystem. The size of an FRU is determined by the tradeoff between spare part costs, reliability and maintenance effort.

Contribution of the TTA 13 The TTA is based around a crystal clear component model A component is a hardware software unit with precisely specified behavior across the message interfaces. A TTA component is a Fault Containment Unit (FCU). Temporal Error Propagation of a Faulty FCU to FCUs that are not affected by the fault is not possible due to the error-containment boundaries established by the time-triggered communication system. An FRU consists of one or more TTA components.

Message Interfaces of a Component 14 TII Technology Independent Control Interface (Configuration and Execution Control) Local Interfaces to Environment Component Linking Interface (LIF) Offers the services of a component. Relevant for Composability TDI Technology Dependent Control Interface

On an MPSoC a TTA Component is an IP-Core 15 9/3/10 FP7 GENESYS 15

Six Steps to Take 16 (1) Identify and design the planned field-replaceable units with appropriate properties (FRU). (2) Characterize the FRUs depending on their expected failure rate. (3) Observe the behavior of FRUs and collect data about anomalous behavior. (4) Analyze the data about anomalous behavior, either online or offline to locate faulty FRUs (5) Replace the faulty FRU (6) Restart the replaced FRU

(2) Design Faults versus Physical Faults 17 Only the impact of a fault on the behavior of a component can be observed. From the behavioral point of view, it is not possible to decide, whether a single failure was caused by a transient hardware fault or a Heisenbug in the software. In order to make this distinction which is very relevant from the point of view of maintenance a set of transient failures in a population of devices must be analyzed.

(2) Transient versus Permanent 18 Transient Failures: the behavior is incorrect, but the hardware is not damaged. A fast restart of the component with a relevant restart state will eliminate the error. Permanent Failure: the hardware is permanently broken und must be replaced. At first, a failure of an electronic component is assumed to be transient. If the restart is not successful, the assumption must be revised and a permanent hardware failure must be assumed.

Foreseen versus Unforeseen Failures 19 It is distinguish between foreseen failures and unforeseen failures: Foreseen Failures (wear out): FRUs with an increasing failure rate, where a failure can be foreseen: provide sensors to measure the parameters that are linked to an increase in the failure rate: e.g., temperature of a bearing. Unforeseen Failures: It is difficult to predict the failure of an electronic component.

Transform Unforeseen Failures to Foreseen Failures 20 What architectural services are needed to implement Triple Modular Redundancy (TMR) at the architecture level? Provision of an Independent Fault-Containment Region for each one of the replicated components Synchronization Infrastructure for the components Predictable Multicast Communication Replicated Communication Channels Support for Voting Deterministic (which includes timely) Operation Identical state in the distributed components

Triple Modular Redundancy (TMR) 21 Triple Modular Redundancy (TMR) is the generally accepted technique for the mitigation of component failures at the system level: V O T E R A/1 V O T E R B/1 A B V O T E R A/2 V O T E R B/2 V O T E R A/3 V O T E R B/3

Costs-Effectiveness of TMR 22 Cost Cost of unscheduled maintenance on-call labor, production loss, etc. Hardware Cost Past Future

Contribution of the TTA 23 The TTA provides the mechanism needed for the implementation of fault tolerance by active redundancy, such as TMR: components are independent FCUs predictable deterministic communication synchronized global time multi-cast communication fault-tolerant clock synchronization After a permanent fault of a FCU, the FCU can be replaced at the next routine scheduled maintenance interval.

Six Steps to Take 24 (1) Identify and design the planned field-replaceable units with appropriate properties (FRU). (2) Characterize the FRUs depending on their expected failure rate. (3) Observe the behavior of FRUs and collect data about anomalous behavior. (4) Analyze the data about anomalous behavior, either online or offline to locate faulty FRUs (5) Replace the faulty FRU (6) Restart the replaced FRU

(3) Observe all Anomalies and Failures 25 An anomaly is a deviation from the normal behavior. It is in the grey zone between correct behavior and failure. It is sometimes difficult to distinguish between an anomaly and an outright failure. An increasing number of anomalies indicates an approaching failure: example intermittent fault Independent observation of the behavior must be supported at the level of the architecture.

The World of States is not Black and White 26 Errors Out of Norm States Intended States (Correct)

Contribution of the TTA 27 The (external) behavior of a TTA component consists of the messages a component exchanges with its environment. The availability of the global time makes it possible to timestamp every anomaly precisely. The basic core communication service is multicast, such that any message can be observed by an independent observer without the probe effect. At the chip level, a dedicated diagnostic component can look at all relevant messages. Every component is designed to output its ground-state periodically in order that it can be evaluated.

Chip Level Prototype of the TTA 28 Diagnostic Component 9/3/10 FP7 GENESYS 28

Six Steps to Take 29 (1) Identify and design the planned field-replaceable units with appropriate properties (FRU). (2) Characterize the FRUs depending on their expected failure rate. (3) Observe the behavior of FRUs and collect data about anomalous behavior. (4) Analyze the data about anomalous behavior, either online or offline to locate faulty FRUs (5) Replace the faulty FRU (6) Restart the replaced FRU

(4) Analyze the Anomalies 30 The collected information about the anomalies must be analyzed either on-line or off-line in a diagnostic subsystem. The establishment of a potential causality link between events should be supported. Failures in the diagnostic subsystem should not propagate into the operational subsystem. Reproducibility of failures needed.

Example (Causality): August 14, 2003 Power Outage 31 On August 14, 2003 a poweroutage caused a blackout in in the North-Eastern United States and Canada. This power outage has been examined and documented in an extensive report.

On the US-Canada Power Outage, August 14, 2003 32 A valuable lesson from the August 14 blackout is the importance of having time-synchronized system data recorders. The Task Force s investigators labored over thousands of data items to determine the sequence of events, much like putting together small pieces of a very large puzzle. That process would have been significantly faster and easier if there had been wider use of synchronized data recording devices. From the Final Report on the August 14, 2003 Blackout from the US-Canada Power Outage Task Force, p. 162 (bold added).

Models of Time in the Cyber World 33 Dense Physics A B Discrete Central Computer Sparse Distributed Computer Real Time Real Time Real Time

Models of Time in the Cyber World 34 Dense Physics A B Discrete Central Computer Sparse Distributed Computer Precision of the Global Time Real Time Real Time Real Time

Fault-Tolerant Sparse Time Base 35 If the occurrence of events is restricted to some active intervals with duration π with an interval of silence of duration Δ between any two active intervals, then we call the time base π/δ-sparse, or sparse for short. 0 1 2 3 4 5 6 7 8 9 Time π Δ π Δ π Events are only allowed to occur at subintervals of the timeline Establish a static order among all involved partners to resolve simultaneity.

The Intervals π and Δ in a Sparse Timebase 36 π Δ Depend on the precision P of the clock synchronization. In reality, the precision is always larger than zero in a distributed system clocks can never be fully synchronized. The precision depends on the stability of the oscillator, the length of the resynchronization interval and the accuracy of interval measurement. On a discrete time-base, there is always the possibility that the same external event will be observed by a tick difference.

Contribution of the TTA 37 Establishment of potential causality is supported by the sparse time base of the TTA determinism helps. The unidirectional core communication service separates the diagnostic subsystem from the operational subsystem. The analysis of the anomalies is performed in an independent diagnostic component (an FCU) of the TTA, with no low-level feedback to the operation Diagnostic components can be supported at different levels of the architecture (e.g., chip level, device level). It is left up to the system designer to decide, whether a high level feedback (reconfiguration) of the results of the diagnosis is automatic or manual.

Six Steps to Take 38 (1) Identify and design the planned field-replaceable units with appropriate properties (FRU). (2) Characterize the FRUs depending on their expected failure rate. (3) Observe the behavior of FRUs and collect data about anomalous behavior. (4) Analyze the data about anomalous behavior, either online or offline to locate faulty FRUs (5) Replace the faulty FRU (6) Restart the replaced FRU

(5) Replace the Faulty FRU 39 The physical replacement of a faulty FRU is a problem of mechanical and electrical design. High availability applications require replicated components to support the on-line replacement of FRUs.

Contribution of the TTA: Reconfiguration 40 Dynamic reconfiguration poses special challenges: Failures of the reconfiguration system are of utmost criticality, because a correct configuration can be destroyed. The TTA enables the implementation of a faulttolerant (TMR) reconfiguration architecture. Downloading new software versions dynamically requires an appropriate secure infrastructure, built on top of the TTA.

Six Steps to Take 41 (1) Identify and design the planned field-replaceable units with appropriate properties (FRU). (2) Characterize the FRUs depending on their expected failure rate. (3) Observe the behavior of FRUs and collect data about anomalous behavior. (4) Analyze the data about anomalous behavior, either online or offline to locate faulty FRUs (5) Replace the faulty FRU (6) Restart the replaced FRU

(6) Restart the Faulty FRU 42 Before a component can continue its service, the relevant ground state at the next reintegration instant must be loaded. Backward recovery to a previous state is not reasonable in a real-time environment.

Contribution of the TTA 43 The TTA requires that a periodic reintegration point (the ground cycle) is designed into the behavior of a component and published at the TII interface. The ground state (g-state) at the reintegration point can be captured periodically by a diagnostic component that is an independent FCU. A component can be reset and restarted by the diagnostic component with a restart message that contains an estimation of the relevant g-state at the next reintegration instant.

Contribution of the TTA 44 In a fault-tolerant configuration, the ground state is voted into a new component autonomously. In non fault-tolerant configurations, state estimation of the ground-state that is relevant at the next reintegration instant must be performed by the diagnostic component.

Conclusion 45 The TTA supports diagnosis and maintainability by its clear component concept that forms a strong basis for the diagnosis. the non-intrusive observeability of component behavior the provision of error containment boundaries that limit the propagation of errors from faulty components. a sparse global time base that support the causal analysis of events in a distributed application.