A Diversity of Duplications

Similar documents
ENSURING SAFETY AND SECURITY FOR AVIONICS: A CASE STUDY

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

PADRE : A Protocol for Asymmetric Duplex REdundancy

CS603: Distributed Systems

Providing Real-Time and Fault Tolerance for CORBA Applications

Fault Tolerant Computing CS 530

Fault Tolerance. The Three universe model

Eliminating Single Points of Failure in Software Based Redundancy

Priya Narasimhan. Assistant Professor of ECE and CS Carnegie Mellon University Pittsburgh, PA

Today: Fault Tolerance

Distributed Systems COMP 212. Lecture 19 Othon Michail

Fault Tolerance. Distributed Systems IT332

FAULT TOLERANT SYSTEMS

Today: Fault Tolerance. Fault Tolerance

Physical Storage Media

TSW Reliability and Fault Tolerance

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Data Backup for Mobile Nodes : a Cooperative Middleware and an Experimentation Platform

Computer-Based Control System Safety Requirements

Part 2: Basic concepts and terminology

Software Diversity and Fault-Tolerance: An Overview

Dep. Systems Requirements

High Availability and Disaster Recovery Solutions for Perforce

Practical Byzantine Fault Tolerance

Planning with Diversified Models for Fault-Tolerant Robots

Dependability tree 1

Fault Tolerance. Basic Concepts

Today: Fault Tolerance. Replica Management

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

2014 Software Global Client Conference

CPLD Developement & Nuclear Safety (NS) Constraints

Critical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability

Time-Triggered Ethernet

Fault Tolerance. Distributed Systems. September 2002

Chapter 8 Fault Tolerance

Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Software Architecture. Lecture 4

CrashOS: Hypervisor testing tool

Virtually Eliminating Router Bugs

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Reliable Statements about a Fault-Tolerant X-by-Wire ecar. Reliable Statements about a Fault-Tolerant X-by-Wire ecar Unrestricted 2017 Siemens AG

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

Advanced Systems Security: Virtual Machine Systems

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Pattern-Based Analysis of an Embedded Real-Time System Architecture

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Reliable Distributed System Approaches

Survey of Cyber Moving Targets. Presented By Sharani Sankaran

A FAULT- AND INTRUSION-TOLERANT ARCHITECTURE FOR THE PORTUGUESE POWER DISTRIBUTION SCADA

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Fault-tolerant techniques

Green Lights Forever: Analyzing the Security of Traffic Infrastructure

Toward Intrusion Tolerant Clouds

Applying MILS to multicore avionics systems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Safety SPL/2010 SPL/20 1

Cyber Moving Targets. Yashar Dehkan Asl

Replace Single Server or Cluster

Mixed Critical Architecture Requirements (MCAR)

FP7-4: Introduction to Reliability and Fault Tolerance. FP7-4: Introduction to Reliability and Fault Tolerance. The NASA Mars Space Mission

Leslie Lamport. April 20, Leslie Lamport. Jenny Tyrväinen. Introduction. Education and Career. Most important works.

From eventual to strong consistency. Primary-Backup Replication. Primary-Backup Replication. Replication State Machines via Primary-Backup

Dependability Threats

Multi-Band (Ku, C, Wideband - Satcom, Narrowband Satcom) Telemetry Test System for UAV Application

HA Use Cases. 1 Introduction. 2 Basic Use Cases

Last Class:Consistency Semantics. Today: More on Consistency

ARCHITECTURE DESIGN FOR SOFT ERRORS

A CAN-Based Architecture for Highly Reliable Communication Systems

Towards Recoverable Hybrid Byzantine Consensus

Complexity-Reducing Design Patterns for Cyber-Physical Systems. DARPA META Project. AADL Standards Meeting January 2011 Steven P.

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

CHAPTER 1: REAL TIME COMPUTER CONTROL

Issues in Programming Language Design for Embedded RT Systems

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

What are Embedded Systems? Lecture 1 Introduction to Embedded Systems & Software

LCCI (Large-scale Complex Critical Infrastructures)

Scalable Architectural Support for Trusted Software

GFS: The Google File System. Dr. Yingwu Zhu

Byzantine Fault Tolerance

DEPENDABLE PROCESSOR DESIGN

Fault Tolerance. Distributed Software Systems. Definitions

IST ATRIUM. A testbed of terabit IP routers running MPLS over DWDM. TF-NGN meeting

Software Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA

Singularity Technical Report 1: Singularity Design Motivation

Stable Embedded Software Systems

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Model-Based Safety Approach for Early Validation of Integrated and Modular Avionics Architectures

Software-based Fault Tolerance Mission (Im)possible?

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

TSM Paper Replicating TSM

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Evolving the CORBA standard to support new distributed real-time and embedded systems

Dependability. IC Life Cycle

Parallel Streaming Computation on Error-Prone Processors. Yavuz Yetim, Margaret Martonosi, Sharad Malik

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

REDCENTRIC VSPHERE AGENT VERSION

Transcription:

A Diversity of Duplications David Powell Special event «Dependability of computing systems, Memories and future» in honor of Jean-Claude Laprie LAAS-CNRS, Toulouse, 16 April 2010

Duplication error Detection error error Tolerance

Outline Some memories on duplication Some recent and ongoing work on duplication

Some memories

1974-1976 Gordini First duplicated system built in LAAS Detection of HW faults Duplicated 8080 8-bit microprocessors 8 kbytes of parity-checked memory

1974-1976 Gordini

1974-1976 Bi-Gordini

1979 Hair! Gordini with people Gordini Jean-Claude Hiro Ihara

1979-1984 Armure A hot standby duplicated system developed for the French space agency in the context of the "SURF national project". Application was as part of the ground segment of the Cospas-Sarsat international satellitebased search-and-rescue system.

1974-1976 Armure A guy that worked on the project

1986-1991 Delta-4 Pioneering work on duplication implemented by software in a CORBA-like environment: active replication passive replication semi-active replication

1987 Delta-4 A guy that didn't work on the project

Some recent and ongoing work on duplication

A Railway Duplication Context: duplication of fail-safe controllers (coded processors) in automatic subway systems Problem: replica consistency despite unreliable communication

Inter-section handover T1 Section A Section B Block lock Negative detectors Controller A Controller B Unregisters trains leaving lock Registers trains entering lock Assigns target: next station or lock, block behind previous train 16

Duplication = danger! T2 Section A Section B Block T1 lock Negative detectors Controller A Controller B A1 A2 B1 B2 B1 registers T2 Assigns target Fails (while T2 proceeds) 17

Duplication = danger! T3 Section A Section B T2 Block T1 lock Negative detectors Controller A Controller B A1 A2 B1 B2 B2 registers T3 Assigns incorrect target (since it missed T2) 18

Problem Consistency between duplicated units Despite unreliable communication provably impossible!

Solution: PADRE Fail-safe multicast Protocol for Asynchronous Duplex REdundancy Repair Nominal duplex config. Simplex config. Fault of primary or secondary Fault of primary Potential inconsistency (transmission error) State restoration Repair Fault of secondary (Benign failure) Safe Safe duplex config. Fault of primary Catastrophic failure Nominal service Unsafe

Solution: PADRE Fail-safe multicast Protocol for Asynchronous Duplex REdundancy Deployed by Siemens Transportation Systems (previously Matra Transport) In New York (Carnarsie line), Barcelona, Paris (line 3), Roissy Soon in Saõ Paulo (line 4), Paris (line 1), Budapest (lines 2 & 4), Helsinki, Algiers, New York (PATH line)

A Robotics Duplication Context: temporal planning for an autonomous robot Problem: insufficient or erroneous knowledge encoded in domain models

Software Architecture Goals Decisional Layer Executive Layer Functional Layer Decision making, planning Decompose plan actions into elementary tasks Execution control of elementary tasks Environment sensing Execution of elementary tasks

How Do Robots Plan? Planning with IxTeT - planning in a plan space Declarative Model objects actions constraints Domain knowledge Heuristics Goals Search Engine Current Situation Initial partial plan Executive Layer Possible final plans Functional Layer

IxTeT Example

Problem Domain knowledge (models, heuristics) may be incomplete or wrong Validation intrinsically difficult Can tolerance be envisaged? Multiplicity of valid but incomparable plans What means can be used for detection?

Solution: FTplan Model 1 Goals Model 2 Detection before execution Temporal watchdog IxTeT FTplan Executive Layer Functional Layer IxTeT Plan analyzer Detection during/after execution Online goal checker Action failure detection Recovery Sequential planning Concurrent planning Dala robot implementation

Solution: FTplan Prototype implementation First diversification of declarative programs Validated by fault injection (model mutation) on simulated Dala robot First fault injection into declarative programs 30-40% goal reliability improvement in presence of injected faults Larger gains to be expected with a plan analyzer

An Avionics Duplication Context: connection of a commercial laptop to a life-critical system (i.e., an aircraft) Problem: malicious intrusion into laptop s COTS operating system

Maintenance laptop Pilot Maintenance engineer Onboard equipment Flight logbook Maintenance terminal Paper manuals Electronic manuals

Maintenance laptop Pilot Maintenance engineer Onboard equipment Flight logbook Maintenance terminal Paper manuals Maintenance laptop

Connecting a laptop Flight management Aircraft management Aircraft information system "Off-board"

Connecting a laptop Flight management Aircraft management Aircraft information system? "Off-board"

Enabling technologies Totel et al s "multi-level integrity" model framework for multiple criticality levels in a single system trusted computing base for isolation and mediation fault-tolerance to allow data to flow from low to high Platform virtualization techniques isolation between virtual machines attractive approach for implementing TCB

View Model Solution: Virtual Duplication ArSec «Architecture de Sécurités» to aircraft equipment 6' Model VO 6" Controller 5 5 4 6 4 View 3 Controller' 3 Controller AspectJ 2 View 2 AspectJ SWING SWING SWING JVM JVM Safe VM JVM Hypervisor 1 7 Error XEN Hardware

Model Controller?! 6" VO 5 5 4 6 4 View Controller' Model View Corruption attack ArSec «Architecture de Sécurités» 6' 3 3 Controller AspectJ 2 View 2 AspectJ SWING SWING SWING JVM JVM Safe VM JVM Hypervisor 1 7 Error XEN Hardware

Model Controller?! 6" VO 5 5 4 6 4 View Controller' Model View Timing attack ArSec «Architecture de Sécurités» 6' 3 3 Controller AspectJ 2 View 2 AspectJ SWING SWING SWING JVM JVM Safe VM JVM Hypervisor 1 7 Error XEN Hardware

Reaction to attack ArSec «Architecture de Sécurités» X 6' X 6?! Model 6" VO Controller 5 5 4 4 View Controller' 3 3 2 View 2 AspectJ AspectJ SWING SWING SWING JVM JVM JVM Safe VM 1 7 Hypervisor Error XXEN Model Controller View Hardware Reboot Change laptops Revert to maintenance terminal

Summary Context Objective Problem Solution PADRE Railways Availability & Safety Unreliable communication Bad diversity Fail-safe asynchronous multicast FTplan Robotics Availability Domain knowledge deficiencies Diversified domain models Good diversity ArSec Avionics Security & Safety Malicious intrusion Virtualization & diversified OS s Good diversity

The Future ArSec «Architecture de Sécurités» Dealing with the dichotomy between: Good diversity: favors independent manifestation of design faults (including vulnerabilities) allowing their tolerance Bad diversity: causes non-deterministic behavior that gives rise to false positives Research directions for dealing with bad diversity: Constraints on internal operation of virtual machines (e.g., thread scheduling) without reducing good diversity Constraints on programmers (e.g., programming styles) without reducing ease-of-programming

Dependability : a Unifying Concept for Reliable Computing (FTCS-12)

A Diversity of Duplications "35 years of duplication without doing the same thing twice"