Stable Embedded Software Systems

Similar documents
ECE 60872/CS 590: Fault-Tolerant Computer System Design Software Fault Tolerance

Complexity-Reducing Design Patterns for Cyber-Physical Systems. DARPA META Project. AADL Standards Meeting January 2011 Steven P.

Safety Architecture Patterns

15. Regression testing

Object-Oriented and Classical Software Engineering

Object-Oriented and Classical Software Engineering

A Multi-Modal Composability Framework for Cyber-Physical Systems

The University of Iowa Fall CS:5810 Formal Methods in Software Engineering. Introduction

Basic Definitions: Testing

Critical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability

3 Ways Businesses Use Network Virtualization. A Faster Path to Improved Security, Automated IT, and App Continuity

TSW Reliability and Fault Tolerance

Just-In-Time Certification

Static Analysis of Embedded Systems

Part 2: Basic concepts and terminology

PERFORMANCE OF GRID COMPUTING FOR DISTRIBUTED NEURAL NETWORK. Submitted By:Mohnish Malviya & Suny Shekher Pankaj [CSE,7 TH SEM]

FAQ: Database System Development Life Cycle

Anders Fröberg TDDD80 STORAGE AND TESTING

BC vs. DR vs. HA vs. EM vs. RM vs. CM: is the difference only terminology?

Principles of Program Analysis. Lecture 1 Harry Xu Spring 2013

TESTING. Overview Slide 6.2. Testing (contd) Slide 6.4. Testing Slide 6.3. Quality issues Non-execution-based testing

Understanding Software Engineering

6.828: OS/Language Co-design. Adam Belay

Race Catcher. Automatically Pinpoints Concurrency Defects in Multi-threaded JVM Applications with 0% False Positives.

Verification and Test with Model-Based Design

CSE 417 Branch & Bound (pt 4) Branch & Bound

(See related materials in textbook.) CSE 435: Software Engineering (slides adapted from Ghezzi et al & Stirewalt

Software Quality. What is Good Software?

THE AUTOMATED TEST FRAMEWORK

TU Darmstadt. Department of Computer Scien

Wireless Network Virtualization: Ensuring Carrier Grade Availability

Combining Complementary Formal Verification Strategies to Improve Performance and Accuracy

Software Testing Overview. Simula Research Laboratory Oslo, Norway

CPSC 320 Sample Solution, Playing with Graphs!

Introduction to Distributed * Systems

Safety Assurance in Software Systems From Airplanes to Atoms

IPMA State of Washington. Disaster Recovery in. State and Local. Governments

FP7-4: Introduction to Reliability and Fault Tolerance. FP7-4: Introduction to Reliability and Fault Tolerance. The NASA Mars Space Mission

VMware vcloud Architecture Toolkit Cloud Bursting

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

A Practical Guide to Cost-Effective Disaster Recovery Planning

Software Quality. Chapter What is Quality?

Introduction To Software Testing. Brian Nielsen. Center of Embedded Software Systems Aalborg University, Denmark CSS

The future of database technology is in the clouds

DHCP Failover: An Improved Approach to DHCP Redundancy

CSE 417 Network Flows (pt 4) Min Cost Flows

Lecture 5 Safety Analysis FHA, HAZOP

DRVerify: The Verification of Physical Verification

Subsystem Hazard Analysis (SSHA)

A Ready Business rises above infrastructure limitations. Vodacom Power to you

RED HAT ENTERPRISE LINUX. STANDARDIZE & SAVE.

A Better Approach to Leveraging an OpenStack Private Cloud. David Linthicum

Telecommunications Network Reliability

How many leaves on the decision tree? There are n! leaves, because every permutation appears at least once.

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

VMAX3: Adaptable Enterprise Resiliency

Time Triggered and Event Triggered; Off-line Scheduling

Introduction to Robust Systems

Gurobi Guidelines for Numerical Issues February 2017

Always-On Connectivity Realizing the Dream of Wi-Fi Everywhere, All the Time

HOLISTIC NETWORK PROTECTION: INNOVATIONS IN SOFTWARE DEFINED NETWORKS

SQL Azure as a Self- Managing Database Service: Lessons Learned and Challenges Ahead

BUILDING A NEXT-GENERATION FIREWALL

Kentucky Wireless Information Network Service (Ky-WINS)

Analyzing Real-Time Systems

Deterministic Ethernet & Unified Networking

REPORT MICROSOFT PATTERNS AND PRACTICES

Virtualization. Q&A with an industry leader. Virtualization is rapidly becoming a fact of life for agency executives,

Test and Evaluation of Autonomous Systems in a Model Based Engineering Context

Key words: TCP/IP, IGP, OSPF Routing protocols, MRC, MRC System.

Runway Situation Awareness Tools (RSAT)

Stack Machines. Towards Scalable Stack Based Parallelism. 1 of 53. Tutorial Organizer: Dr Chris Crispin-Bailey

Aerospace Software Engineering

One Release. One Architecture. One OS. High-Performance Networking for the Enterprise with JUNOS Software

SUPERIOR MISSION SYSTEMS Faster, Resilient, Secure & More Affordable

Software Testing. Software Testing. Theory, Practise and Reality IBM Corporation

Introduction to Algorithms

Intro to Proving Absence of Errors in C/C++ Code

When Embedded Systems Attack. Unit 22. Therac-25. Therac-25. Embedded Failures. Embedded systems can fail for a variety of reasons

Violations of the contract are exceptions, and are usually handled by special language constructs. Design by contract

Greats Bugs in History

Announcements. Testing. Announcements. Announcements

Wireless Network Security Spring 2015

NEC Express5800 R320f Fault Tolerant Servers & NEC ExpressCluster Software

SOFTWARE CONFIGURATION MANAGEMENT

Certification Requirements for High Assurance Systems

Issues in Programming Language Design for Embedded RT Systems

CODE / CONFIGURATION COVERAGE

Higher-order Testing. Stuart Anderson. Stuart Anderson Higher-order Testing c 2011

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Fault Tolerance. Distributed Systems IT332

WHY BUILDING SECURITY SYSTEMS NEED CONTINUOUS AVAILABILITY

Mathematical preliminaries and error analysis

XVIII. Software Testing. Laurea Triennale in Informatica Corso di Ingegneria del Software I A.A. 2006/2007 Andrea Polini

By V-cubed Solutions, Inc. Page1. All rights reserved by V-cubed Solutions, Inc.

Leveraging Formal Methods Based Software Verification to Prove Code Quality & Achieve MISRA compliance

CIO Guide: Disaster recovery solutions that work. Making it happen with Azure in the public cloud

How Industrial PoE Switches Facilitate Reliable Outdoor IP Surveillance Networks. Jackey Hsueh Product Manager

Data safety for digital business. Veritas Backup Exec WHITE PAPER. One solution for hybrid, physical, and virtual environments.

Transcription:

Building Stable Embedded Software Systems Lui Sha lrs@cs.uiuc.edu Feb 2006 lrs@cs.uiuc.edu 1

The challenges of building large systems FAA's major modernization project, the Advanced Automation System (AAS), was originally estimated to cost $2.5 billion with a completion date of 1996. In 1994, FAA cancelled the AAS program, casting aside 11 years of development time and, according to GAO, wasting more than $1.5 billion of taxpayer money. http://www.asiaweek.com/asiaweek/98/0717/nat_6_clk.html According to a study by IBM, in a typical commercial development organization, debugging, testing, and verification activities can easily range from 50 to 75 percent of the total development cost. http://www.research.ibm.com/journal/sj/411/hailpern.html lrs@cs.uiuc.edu 2

Unexpected interactions Implicit and inconsistent assumptions and abstractions Incompatible Cross Domain Protocols Incompatible assumptions of HW & SW regarding the operation of legs led to the loss of the Mars Polar Lander Pathological Interaction between RT and sync. protocols Pathfinder caused repeated resets, nearly doomed the mission lrs@cs.uiuc.edu 3

Systems Instabilities Operationally, an unstable system is one that would allow a fault in a non-critical component to cascade into system failure. For example, on June 4 1996, about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, Araine 5 veered from its flight path, broke up and exploded. The most astonishing investigation result is that the root cause was within a reused Ariane 4 software component not required by Ariane 5[1]. [1] Http://www.rvs.uni-bielefeld.de/publications/In lrs@cs.uiuc.edu 4

Too Close for Comfort Recently, emergency AD 2005-18-51 was issued on August 29, 2005. FAA explains as follows: we received a recent report of a significant noseup pitch event on a Boeing Model 777-200 series airplane while climbing through 36,000 feet altitude. The flight crew disconnected the autopilot and stabilized the airplane, during which time the airplane climbed above 41,000 feet, decelerated to a minimum speed of 158 knots, and activated the stick shaker. We have evaluated all pertinent information and identified an unsafe condition that is likely to exist or develop on other Boeing Model 777 airplanes of this same type design. These anomalies could result in high pilot workload, deviation from the intended flight path, and possible loss of control of the airplane. lrs@cs.uiuc.edu 5

How to build a reliable service? There two parties of thoughts Fault avoidance party: Put all the eggs in a bullet-proof basket Fault tolerance party: Use diversity, e.g., N-version programming Which party will you vote for? lrs@cs.uiuc.edu 6

Complexity, diversity and reliability To build a robust software system that can tolerate software faults, we must understand the relations between software Complexity: the root cause of software faults Diversity: a necessary condition for software fault tolerance. Reliability: a function of complexity and diversity We shall begin with postulates based self-evident facts lrs@cs.uiuc.edu 7

Software development postulates We assert that the following postulates self-evident P1: Complexity Breeds Bugs: Everything else being equal, the more complex the software project is, the harder it is to make it reliable. P2: All Bugs are Not Equal: You fix a bunch of obvious bugs quickly, but finding and fixing the last few bugs is much harder. P3: All Budgets are Finite: There is only a finite amount of effort (budget) that we can spend on any project. How can we model software complexity? lrs@cs.uiuc.edu 8

Logical complexity Computational complexity => the number of steps in computation. Logical complexity => the number of steps in verification. A program can have different logical and computational complexities. Bubble-sort: lower logical complexity but higher computational complexity. Heap sort: the other way around. Residue logical complexity. A program could have high logical complexity initially. However, if it has been verified and can be used as is, then the residue complexity is zero lrs@cs.uiuc.edu 9

The implications P1: Complexity Breeds Bugs: For a given mission duration t, the reliability of software decreases as complexity increases. P2: All Bugs are Not Equal: for a given degree of complexity, the reliability function has a monotonically decreasing rate of improvement with respect to development effort. P3: Budgets are finite: Diversity is not free. That is, if we go for n version diversity, we must divide the available effort n-ways. One simple model that satisfies P1, P2 and P3 Sum of efforts used in diversity = available effort Reliability function: e k (complexity / effort ) t lrs@cs.uiuc.edu 10

Diversity, complexity and reliability 3-version programming 1-version programming A reliable core with 10x complexity reduction. Analysis shows that what really counts is not the degree of diversity. Rather it is the existence of a simple and reliable core that can guarantee the stability of the system. This result is also robust against change of model assumptions. --- Using Simplicity to Control Complexity, IEEE Software 7/8, 2001, L. Sha lrs@cs.uiuc.edu 11

On stability In the foreseeable future, we can only build a small number of modest size defect free components at great expense. To plan otherwise is imprudent is overly optimistic at best. We need to learn to build structurally stable software systems with A small number defect free components A modest number of nearly defect free components A majority of COTS quality components with residual bugs. lrs@cs.uiuc.edu 12

When You Can t Keep it Simple Conceptually, to ensure the stability of a software system, we need to 1. Separate requirements into different criticality levels 2. Allocate requirements with different criticality levels to different components 3. Ensure that critical components can only USE but not DEPEND on the service of non-critical components 4. Ensure that critical components are simple enough so that we can build it reliable But it is hard to keep things simple in practice because of the features and performance that we want. A solution to the reliability vs performance dilemma is to use analytically redundant components that allow us to use simplicity to control complexity. lrs@cs.uiuc.edu 13

Some Questions What is the definition of stability in a software system? How to develop analytically redundant components and safely use unreliable services? How can analytic redundancy help solve the infamous state explosion problem? What is the domain of convergence in software stability control? How can we analyze the structural stability of a software system? We shall illustrate these idea by a simple example lrs@cs.uiuc.edu 14

An example Once upon a time, there was an exam on sorting programs. Grades are given as follows: A: Correct and fast: n log (n) in worst case B: Correct but slow F: Incorrect Joe can verify his bubble sort, but has only 50% chance to write Heap Sort correctly. What is his optimal strategy? lrs@cs.uiuc.edu 15

Stability of a software system Often, requirements can be decomposed into Critical (correctness) requirements Sorting: output numbers in correct order; TSP: visit every city exactly once Control: stable and controllable Performance optimization Sorting: faster TSP: shorter path Control: less time/error/energy Heap Sort Bubble Sort Bounded responses to errors: A stable software system is one that can maintain key properties in spite of errors in non-critical components lrs@cs.uiuc.edu 16

Stability control What if the untrusted sorting program alters an item in the input list? 1. Create a verified simple primitive called permute 2. Untrusted sorting software is not allowed to touch the input list except use the permute primitive. 3. Enforce the restriction using an object with (only) method permute Under stability control, the untrusted Heap-sort can only produce out of order application errors. Domain of convergence in software error control is the states that satisfy the precondition of recovery procedure. Stability control is the mechanism used to ensure the preconditions will hold. State explosion in stability controlled component is a non-problem A stable system allows for SAFE TESTING of NEW COMPONENTS lrs@cs.uiuc.edu 17

Stability control for control software LynxOS A/V Streams Simplex annotated, pre-recorded presentation (e.g. HTML) (in case of communication failures) A/V Streams Win98/NT Win98/NT Win98/NT : Telelab Screen Shot http://www-rtsl.cs.uiuc.edu/ click project, click drii, click telelab download lrs@cs.uiuc.edu 18

Transform depend relation to USE relation Having a reliable controller, we identify the recovery region within which the controller can operate successfully. Recovery region is a subset of the states that are admissible with respect to operational constraints The largest recovery region can be found using LMI. This approach is applicable to any linearizable systems. They cover most of the practical control systems. X AX T A Q + Q A < 0 min l og det Q subject to 1 T C X < 1 operational constraints Stability envelope Recovery Region The system under new complex controller must stay within recovery region T Safety switching rule: X QX < 1 lrs@cs.uiuc.edu 19

Simplex Architecture for Control Trusted simple and reliable controller Stability Monitoring Plant T X QX < 1 Online upgradeable complex controller Data Flow Block Diagram lrs@cs.uiuc.edu 20

The Inescapable Conclusion The complexity of software has long past the state that we can produce 100% defect free software. Denying this is naïve at best. However, our society is increasingly rely upon software whose complexity is ever increasing. And it is unacceptable to let a minor error to cascade and bring down a major system. The inescapable conclusion is that we must develop the scientific foundation for engineering stable software systems: systems not completely error free but can reliably deliver essential services in spite of residual errors. All features are not equal. Some are safety critical, some mission critical, some useful and some have questionable values The key is have a reliable core and well formed dependency. A critical component may USE but not DEPEND on less critical services. lrs@cs.uiuc.edu 21

Reasons to be Optimistic United States of America is a highly stable and evolvable system. It has grown and made truly remarkable progress by the metric of civilization, even though many problems remain. But its basic components, human beings, are complex, error prone, and hard to test or verify. There are thousands of residual bugs in the telecomm network and it remains highly reliable. There are perhaps millions of bugs in the World Wide Web system of systems, but it is remarkably stable. Complex but stable systems are uncommon but can be and have been built. lrs@cs.uiuc.edu 22

Appendix lrs@cs.uiuc.edu 23

Sources of difficulties Unexpected interactions resulting from incompatible abstractions, incorrect or implicit assumptions in system interfaces, and incompatible real time, fault tolerance, and security protocols. Inadequate development infrastructure as reflected in the lack of domain specific-reference architectures, tools, and design patterns with known and parameterized real time, robustness, and security properties. System instabilities that result when faults and failures in one component cascade along complex and unexpected dependency graphs resulting in catastrophic failures in a large part or even an entire system. lrs@cs.uiuc.edu 24

Not Isolated Incidents These are not isolated incidents. Rather, accidents and developmental problems are the manifestation of building modern avionics systems with a complexity higher than what can be handled by existing technological infrastructure. The Standish group reported that a staggering 31.1% of projects will be canceled before they ever get completed. Further results indicate 52.7% of projects will cost 189% of their original estimates. The cost of these failures and overruns are just the tip of the proverbial iceberg. [2] [1] http://www.gao.gov/new.items/d04393.pdf [2] http://www.cs.nmt.edu/~cs328/reading/standish.pdf lrs@cs.uiuc.edu 25

Stable Systems In most applications, all features are not equal: some are critical, some are important, some are useful, and some are superfluous. Giving the existing technologies, industry can only afford to make critical features highly reliable. Complex and unknown dependency relations are a key contributor to software system instability. That is, a seemingly minor fault in a non-critical service can cascade along dependency chains and bring down the whole system. A stable software system is one that guarantees critical system properties and allows safe exploitation of imperfect but useful components. lrs@cs.uiuc.edu 26