A Diversity of Duplications David Powell Special event «Dependability of computing systems, Memories and future» in honor of Jean-Claude Laprie LAAS-CNRS, Toulouse, 16 April 2010
Duplication error Detection error error Tolerance
Outline Some memories on duplication Some recent and ongoing work on duplication
Some memories
1974-1976 Gordini First duplicated system built in LAAS Detection of HW faults Duplicated 8080 8-bit microprocessors 8 kbytes of parity-checked memory
1974-1976 Gordini
1974-1976 Bi-Gordini
1979 Hair! Gordini with people Gordini Jean-Claude Hiro Ihara
1979-1984 Armure A hot standby duplicated system developed for the French space agency in the context of the "SURF national project". Application was as part of the ground segment of the Cospas-Sarsat international satellitebased search-and-rescue system.
1974-1976 Armure A guy that worked on the project
1986-1991 Delta-4 Pioneering work on duplication implemented by software in a CORBA-like environment: active replication passive replication semi-active replication
1987 Delta-4 A guy that didn't work on the project
Some recent and ongoing work on duplication
A Railway Duplication Context: duplication of fail-safe controllers (coded processors) in automatic subway systems Problem: replica consistency despite unreliable communication
Inter-section handover T1 Section A Section B Block lock Negative detectors Controller A Controller B Unregisters trains leaving lock Registers trains entering lock Assigns target: next station or lock, block behind previous train 16
Duplication = danger! T2 Section A Section B Block T1 lock Negative detectors Controller A Controller B A1 A2 B1 B2 B1 registers T2 Assigns target Fails (while T2 proceeds) 17
Duplication = danger! T3 Section A Section B T2 Block T1 lock Negative detectors Controller A Controller B A1 A2 B1 B2 B2 registers T3 Assigns incorrect target (since it missed T2) 18
Problem Consistency between duplicated units Despite unreliable communication provably impossible!
Solution: PADRE Fail-safe multicast Protocol for Asynchronous Duplex REdundancy Repair Nominal duplex config. Simplex config. Fault of primary or secondary Fault of primary Potential inconsistency (transmission error) State restoration Repair Fault of secondary (Benign failure) Safe Safe duplex config. Fault of primary Catastrophic failure Nominal service Unsafe
Solution: PADRE Fail-safe multicast Protocol for Asynchronous Duplex REdundancy Deployed by Siemens Transportation Systems (previously Matra Transport) In New York (Carnarsie line), Barcelona, Paris (line 3), Roissy Soon in Saõ Paulo (line 4), Paris (line 1), Budapest (lines 2 & 4), Helsinki, Algiers, New York (PATH line)
A Robotics Duplication Context: temporal planning for an autonomous robot Problem: insufficient or erroneous knowledge encoded in domain models
Software Architecture Goals Decisional Layer Executive Layer Functional Layer Decision making, planning Decompose plan actions into elementary tasks Execution control of elementary tasks Environment sensing Execution of elementary tasks
How Do Robots Plan? Planning with IxTeT - planning in a plan space Declarative Model objects actions constraints Domain knowledge Heuristics Goals Search Engine Current Situation Initial partial plan Executive Layer Possible final plans Functional Layer
IxTeT Example
Problem Domain knowledge (models, heuristics) may be incomplete or wrong Validation intrinsically difficult Can tolerance be envisaged? Multiplicity of valid but incomparable plans What means can be used for detection?
Solution: FTplan Model 1 Goals Model 2 Detection before execution Temporal watchdog IxTeT FTplan Executive Layer Functional Layer IxTeT Plan analyzer Detection during/after execution Online goal checker Action failure detection Recovery Sequential planning Concurrent planning Dala robot implementation
Solution: FTplan Prototype implementation First diversification of declarative programs Validated by fault injection (model mutation) on simulated Dala robot First fault injection into declarative programs 30-40% goal reliability improvement in presence of injected faults Larger gains to be expected with a plan analyzer
An Avionics Duplication Context: connection of a commercial laptop to a life-critical system (i.e., an aircraft) Problem: malicious intrusion into laptop s COTS operating system
Maintenance laptop Pilot Maintenance engineer Onboard equipment Flight logbook Maintenance terminal Paper manuals Electronic manuals
Maintenance laptop Pilot Maintenance engineer Onboard equipment Flight logbook Maintenance terminal Paper manuals Maintenance laptop
Connecting a laptop Flight management Aircraft management Aircraft information system "Off-board"
Connecting a laptop Flight management Aircraft management Aircraft information system? "Off-board"
Enabling technologies Totel et al s "multi-level integrity" model framework for multiple criticality levels in a single system trusted computing base for isolation and mediation fault-tolerance to allow data to flow from low to high Platform virtualization techniques isolation between virtual machines attractive approach for implementing TCB
View Model Solution: Virtual Duplication ArSec «Architecture de Sécurités» to aircraft equipment 6' Model VO 6" Controller 5 5 4 6 4 View 3 Controller' 3 Controller AspectJ 2 View 2 AspectJ SWING SWING SWING JVM JVM Safe VM JVM Hypervisor 1 7 Error XEN Hardware
Model Controller?! 6" VO 5 5 4 6 4 View Controller' Model View Corruption attack ArSec «Architecture de Sécurités» 6' 3 3 Controller AspectJ 2 View 2 AspectJ SWING SWING SWING JVM JVM Safe VM JVM Hypervisor 1 7 Error XEN Hardware
Model Controller?! 6" VO 5 5 4 6 4 View Controller' Model View Timing attack ArSec «Architecture de Sécurités» 6' 3 3 Controller AspectJ 2 View 2 AspectJ SWING SWING SWING JVM JVM Safe VM JVM Hypervisor 1 7 Error XEN Hardware
Reaction to attack ArSec «Architecture de Sécurités» X 6' X 6?! Model 6" VO Controller 5 5 4 4 View Controller' 3 3 2 View 2 AspectJ AspectJ SWING SWING SWING JVM JVM JVM Safe VM 1 7 Hypervisor Error XXEN Model Controller View Hardware Reboot Change laptops Revert to maintenance terminal
Summary Context Objective Problem Solution PADRE Railways Availability & Safety Unreliable communication Bad diversity Fail-safe asynchronous multicast FTplan Robotics Availability Domain knowledge deficiencies Diversified domain models Good diversity ArSec Avionics Security & Safety Malicious intrusion Virtualization & diversified OS s Good diversity
The Future ArSec «Architecture de Sécurités» Dealing with the dichotomy between: Good diversity: favors independent manifestation of design faults (including vulnerabilities) allowing their tolerance Bad diversity: causes non-deterministic behavior that gives rise to false positives Research directions for dealing with bad diversity: Constraints on internal operation of virtual machines (e.g., thread scheduling) without reducing good diversity Constraints on programmers (e.g., programming styles) without reducing ease-of-programming
Dependability : a Unifying Concept for Reliable Computing (FTCS-12)
A Diversity of Duplications "35 years of duplication without doing the same thing twice"