Building Stable Embedded Software Systems Lui Sha lrs@cs.uiuc.edu Feb 2006 lrs@cs.uiuc.edu 1
The challenges of building large systems FAA's major modernization project, the Advanced Automation System (AAS), was originally estimated to cost $2.5 billion with a completion date of 1996. In 1994, FAA cancelled the AAS program, casting aside 11 years of development time and, according to GAO, wasting more than $1.5 billion of taxpayer money. http://www.asiaweek.com/asiaweek/98/0717/nat_6_clk.html According to a study by IBM, in a typical commercial development organization, debugging, testing, and verification activities can easily range from 50 to 75 percent of the total development cost. http://www.research.ibm.com/journal/sj/411/hailpern.html lrs@cs.uiuc.edu 2
Unexpected interactions Implicit and inconsistent assumptions and abstractions Incompatible Cross Domain Protocols Incompatible assumptions of HW & SW regarding the operation of legs led to the loss of the Mars Polar Lander Pathological Interaction between RT and sync. protocols Pathfinder caused repeated resets, nearly doomed the mission lrs@cs.uiuc.edu 3
Systems Instabilities Operationally, an unstable system is one that would allow a fault in a non-critical component to cascade into system failure. For example, on June 4 1996, about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, Araine 5 veered from its flight path, broke up and exploded. The most astonishing investigation result is that the root cause was within a reused Ariane 4 software component not required by Ariane 5[1]. [1] Http://www.rvs.uni-bielefeld.de/publications/In lrs@cs.uiuc.edu 4
Too Close for Comfort Recently, emergency AD 2005-18-51 was issued on August 29, 2005. FAA explains as follows: we received a recent report of a significant noseup pitch event on a Boeing Model 777-200 series airplane while climbing through 36,000 feet altitude. The flight crew disconnected the autopilot and stabilized the airplane, during which time the airplane climbed above 41,000 feet, decelerated to a minimum speed of 158 knots, and activated the stick shaker. We have evaluated all pertinent information and identified an unsafe condition that is likely to exist or develop on other Boeing Model 777 airplanes of this same type design. These anomalies could result in high pilot workload, deviation from the intended flight path, and possible loss of control of the airplane. lrs@cs.uiuc.edu 5
How to build a reliable service? There two parties of thoughts Fault avoidance party: Put all the eggs in a bullet-proof basket Fault tolerance party: Use diversity, e.g., N-version programming Which party will you vote for? lrs@cs.uiuc.edu 6
Complexity, diversity and reliability To build a robust software system that can tolerate software faults, we must understand the relations between software Complexity: the root cause of software faults Diversity: a necessary condition for software fault tolerance. Reliability: a function of complexity and diversity We shall begin with postulates based self-evident facts lrs@cs.uiuc.edu 7
Software development postulates We assert that the following postulates self-evident P1: Complexity Breeds Bugs: Everything else being equal, the more complex the software project is, the harder it is to make it reliable. P2: All Bugs are Not Equal: You fix a bunch of obvious bugs quickly, but finding and fixing the last few bugs is much harder. P3: All Budgets are Finite: There is only a finite amount of effort (budget) that we can spend on any project. How can we model software complexity? lrs@cs.uiuc.edu 8
Logical complexity Computational complexity => the number of steps in computation. Logical complexity => the number of steps in verification. A program can have different logical and computational complexities. Bubble-sort: lower logical complexity but higher computational complexity. Heap sort: the other way around. Residue logical complexity. A program could have high logical complexity initially. However, if it has been verified and can be used as is, then the residue complexity is zero lrs@cs.uiuc.edu 9
The implications P1: Complexity Breeds Bugs: For a given mission duration t, the reliability of software decreases as complexity increases. P2: All Bugs are Not Equal: for a given degree of complexity, the reliability function has a monotonically decreasing rate of improvement with respect to development effort. P3: Budgets are finite: Diversity is not free. That is, if we go for n version diversity, we must divide the available effort n-ways. One simple model that satisfies P1, P2 and P3 Sum of efforts used in diversity = available effort Reliability function: e k (complexity / effort ) t lrs@cs.uiuc.edu 10
Diversity, complexity and reliability 3-version programming 1-version programming A reliable core with 10x complexity reduction. Analysis shows that what really counts is not the degree of diversity. Rather it is the existence of a simple and reliable core that can guarantee the stability of the system. This result is also robust against change of model assumptions. --- Using Simplicity to Control Complexity, IEEE Software 7/8, 2001, L. Sha lrs@cs.uiuc.edu 11
On stability In the foreseeable future, we can only build a small number of modest size defect free components at great expense. To plan otherwise is imprudent is overly optimistic at best. We need to learn to build structurally stable software systems with A small number defect free components A modest number of nearly defect free components A majority of COTS quality components with residual bugs. lrs@cs.uiuc.edu 12
When You Can t Keep it Simple Conceptually, to ensure the stability of a software system, we need to 1. Separate requirements into different criticality levels 2. Allocate requirements with different criticality levels to different components 3. Ensure that critical components can only USE but not DEPEND on the service of non-critical components 4. Ensure that critical components are simple enough so that we can build it reliable But it is hard to keep things simple in practice because of the features and performance that we want. A solution to the reliability vs performance dilemma is to use analytically redundant components that allow us to use simplicity to control complexity. lrs@cs.uiuc.edu 13
Some Questions What is the definition of stability in a software system? How to develop analytically redundant components and safely use unreliable services? How can analytic redundancy help solve the infamous state explosion problem? What is the domain of convergence in software stability control? How can we analyze the structural stability of a software system? We shall illustrate these idea by a simple example lrs@cs.uiuc.edu 14
An example Once upon a time, there was an exam on sorting programs. Grades are given as follows: A: Correct and fast: n log (n) in worst case B: Correct but slow F: Incorrect Joe can verify his bubble sort, but has only 50% chance to write Heap Sort correctly. What is his optimal strategy? lrs@cs.uiuc.edu 15
Stability of a software system Often, requirements can be decomposed into Critical (correctness) requirements Sorting: output numbers in correct order; TSP: visit every city exactly once Control: stable and controllable Performance optimization Sorting: faster TSP: shorter path Control: less time/error/energy Heap Sort Bubble Sort Bounded responses to errors: A stable software system is one that can maintain key properties in spite of errors in non-critical components lrs@cs.uiuc.edu 16
Stability control What if the untrusted sorting program alters an item in the input list? 1. Create a verified simple primitive called permute 2. Untrusted sorting software is not allowed to touch the input list except use the permute primitive. 3. Enforce the restriction using an object with (only) method permute Under stability control, the untrusted Heap-sort can only produce out of order application errors. Domain of convergence in software error control is the states that satisfy the precondition of recovery procedure. Stability control is the mechanism used to ensure the preconditions will hold. State explosion in stability controlled component is a non-problem A stable system allows for SAFE TESTING of NEW COMPONENTS lrs@cs.uiuc.edu 17
Stability control for control software LynxOS A/V Streams Simplex annotated, pre-recorded presentation (e.g. HTML) (in case of communication failures) A/V Streams Win98/NT Win98/NT Win98/NT : Telelab Screen Shot http://www-rtsl.cs.uiuc.edu/ click project, click drii, click telelab download lrs@cs.uiuc.edu 18
Transform depend relation to USE relation Having a reliable controller, we identify the recovery region within which the controller can operate successfully. Recovery region is a subset of the states that are admissible with respect to operational constraints The largest recovery region can be found using LMI. This approach is applicable to any linearizable systems. They cover most of the practical control systems. X AX T A Q + Q A < 0 min l og det Q subject to 1 T C X < 1 operational constraints Stability envelope Recovery Region The system under new complex controller must stay within recovery region T Safety switching rule: X QX < 1 lrs@cs.uiuc.edu 19
Simplex Architecture for Control Trusted simple and reliable controller Stability Monitoring Plant T X QX < 1 Online upgradeable complex controller Data Flow Block Diagram lrs@cs.uiuc.edu 20
The Inescapable Conclusion The complexity of software has long past the state that we can produce 100% defect free software. Denying this is naïve at best. However, our society is increasingly rely upon software whose complexity is ever increasing. And it is unacceptable to let a minor error to cascade and bring down a major system. The inescapable conclusion is that we must develop the scientific foundation for engineering stable software systems: systems not completely error free but can reliably deliver essential services in spite of residual errors. All features are not equal. Some are safety critical, some mission critical, some useful and some have questionable values The key is have a reliable core and well formed dependency. A critical component may USE but not DEPEND on less critical services. lrs@cs.uiuc.edu 21
Reasons to be Optimistic United States of America is a highly stable and evolvable system. It has grown and made truly remarkable progress by the metric of civilization, even though many problems remain. But its basic components, human beings, are complex, error prone, and hard to test or verify. There are thousands of residual bugs in the telecomm network and it remains highly reliable. There are perhaps millions of bugs in the World Wide Web system of systems, but it is remarkably stable. Complex but stable systems are uncommon but can be and have been built. lrs@cs.uiuc.edu 22
Appendix lrs@cs.uiuc.edu 23
Sources of difficulties Unexpected interactions resulting from incompatible abstractions, incorrect or implicit assumptions in system interfaces, and incompatible real time, fault tolerance, and security protocols. Inadequate development infrastructure as reflected in the lack of domain specific-reference architectures, tools, and design patterns with known and parameterized real time, robustness, and security properties. System instabilities that result when faults and failures in one component cascade along complex and unexpected dependency graphs resulting in catastrophic failures in a large part or even an entire system. lrs@cs.uiuc.edu 24
Not Isolated Incidents These are not isolated incidents. Rather, accidents and developmental problems are the manifestation of building modern avionics systems with a complexity higher than what can be handled by existing technological infrastructure. The Standish group reported that a staggering 31.1% of projects will be canceled before they ever get completed. Further results indicate 52.7% of projects will cost 189% of their original estimates. The cost of these failures and overruns are just the tip of the proverbial iceberg. [2] [1] http://www.gao.gov/new.items/d04393.pdf [2] http://www.cs.nmt.edu/~cs328/reading/standish.pdf lrs@cs.uiuc.edu 25
Stable Systems In most applications, all features are not equal: some are critical, some are important, some are useful, and some are superfluous. Giving the existing technologies, industry can only afford to make critical features highly reliable. Complex and unknown dependency relations are a key contributor to software system instability. That is, a seemingly minor fault in a non-critical service can cascade along dependency chains and bring down the whole system. A stable software system is one that guarantees critical system properties and allows safe exploitation of imperfect but useful components. lrs@cs.uiuc.edu 26