High Reliability Systems. Lloyd Moore, President

Similar documents
Finding Firmware Defects Class T-18 Sean M. Beatty

Buffer overflow background

CS 326: Operating Systems. Process Execution. Lecture 5

Changing the Embedded World TM. Module 3: Getting Started Debugging

NEW CEIBO DEBUGGER. Menus and Commands

Timers and Counters. LISHA/UFSC Prof. Dr. Antônio Augusto Fröhlich Fauze Valério Polpeta Lucas Francisco Wanner.

ECS 153 Discussion Section. April 6, 2015

Unit 2 : Computer and Operating System Structure

19: I/O Devices: Clocks, Power Management

Texas Instruments Mixed Signal Processor Tutorial Abstract

UNIT II PROCESSOR AND MEMORY ORGANIZATION

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation

Advanced Programming & C++ Language

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

The purpose of this course is to provide an introduction to the RL78's flash features and archectecture including security features, code and data

Lecture 2: September 9

Enterprise Architect. User Guide Series. Profiling

Enterprise Architect. User Guide Series. Profiling. Author: Sparx Systems. Date: 10/05/2018. Version: 1.0 CREATED WITH

How to Break Software by James Whittaker

NATIONAL CONTROL DEVICES Fusion ProXR Advanced Quick Start Guide

F2MC-8FX EEPROM Library

Z8ICE001ZEM Z8PLUS EMULATOR PRODUCT SPECIFICATION KIT CONTENTS OPTIONAL ITEMS NOT SUPPLIED

System Messages - Numerical List

Chapter 5 - Input / Output

EMBEDDED SOFTWARE DEVELOPMENT. George Hadley 2017, Images Property of their Respective Owners

Renesas 78K/78K0R/RL78 Family In-Circuit Emulation

Trace Getting Started V8.02

Important From Last Time

CSE 565 Computer Security Fall 2018

Programming with MPI

Regression testing. Whenever you find a bug. Why is this a good idea?

Guidelines for Writing C Code

Security aspects in CANopen bootloaders

int $0x32 // call interrupt number 50

By the end of Class. Outline. Homework 5. C8051F020 Block Diagram (pg 18) Pseudo-code for Lab 1-2 due as part of prelab

Introduction to Embedded Systems

Green Hills Software, Inc.

18-349: Introduction to Embedded Real-Time Systems

Overview of Microcontroller and Embedded Systems

Exam TI2720-C/TI2725-C Embedded Software

Announcement. Exercise #2 will be out today. Due date is next Monday

Program Correctness and Efficiency. Chapter 2

Week 5, continued. This is CS50. Harvard University. Fall Cheng Gong

If you require more information that is not included in this document, please contact us and we will be happy to provide you with further detail.

CS 147: Computer Systems Performance Analysis

embos Real-Time Operating System embos plug-in for IAR C-Spy Debugger Document: UM01025 Software Version: 3.1 Revision: 0 Date: May 3, 2018

Three OPTIMIZING. Your System for Photoshop. Tuning for Performance

Gisselquist Technology, LLC

Data Structure Series

Visual Profiler. User Guide

Using the KD30 Debugger

COSC Software Engineering. Lecture 16: Managing Memory Managers

embos Real-Time Operating System embos plug-in for IAR C-Spy Debugger Document: UM01025 Software Version: 3.0 Revision: 0 Date: September 18, 2017

Chapter 9: Virtual-Memory

Operating Systems CMPSCI 377 Spring Mark Corner University of Massachusetts Amherst

Welcome to Lab! Feel free to get started until we start talking! The lab document is located on the course website:

CROSSWARE 7 V8051NT Virtual Workshop for Windows. q Significantly reduces software development timescales

Page 1. Stuff. Last Time. Today. Safety-Critical Systems MISRA-C. Terminology. Interrupts Inline assembly Intrinsics

CMSC 330: Organization of Programming Languages

CS 370 The Pseudocode Programming Process D R. M I C H A E L J. R E A L E F A L L

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

COSC Software Engineering. Lectures 14 and 15: The Heap and Dynamic Memory Allocation

CMSC 330: Organization of Programming Languages

Chapter 9. Software Testing

EXAMINING THE CODE. 1. Examining the Design and Code 2. Formal Review: 3. Coding Standards and Guidelines: 4. Generic Code Review Checklist:

Contents. Cortex M On-Chip Emulation. Technical Notes V

Hello and welcome to this Renesas Interactive course that covers the Watchdog timer found on RX MCUs.

CS 520 Theory and Practice of Software Engineering Fall 2018

Bare Metal Application Design, Interrupts & Timers

MPLAB X Debugging Techniques

Steps for project success. git status. Milestones. Deliverables. Homework 1 submitted Homework 2 will be posted October 26.

Troubleshooting & Repair

_ V Intel 8085 Family In-Circuit Emulation. Contents. Technical Notes

MPLAB SIM. MPLAB IDE Software Simulation Engine Microchip Technology Incorporated MPLAB SIM Software Simulation Engine

GUI ScreenIO Client/Server Layer Job Timeout Facility

Unit Testing as Hypothesis Testing

Java Without the Jitter

COSC345 Software Engineering. The Heap And Dynamic Memory Allocation

Understanding the basic building blocks of a microcontroller device in general. Knows the terminologies like embedded and external memory devices,

1: Introduction to Object (1)

Sample Exam ISTQB Advanced Test Analyst Answer Rationale. Prepared By

Debugging with System Analyzer. Todd Mullanix TI-RTOS Apps Manager Oct. 15, 2017

COSC345 Software Engineering. Basic Computer Architecture and The Stack

Course Introduction. Purpose: Objectives: Content: 27 pages 4 questions. Learning Time: 20 minutes

Debugging uclinux on Coldfire

MISRA-C. Subset of the C language for critical systems

OS: An Overview. ICS332 Operating Systems

Objectives for this class meeting. 1. Conduct review of core concepts concerning contracts and pre/post conditions

April 4, 2001: Debugging Your C24x DSP Design Using Code Composer Studio Real-Time Monitor

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

88 Dugald Campbell. Making Industrial Systems Safer Meeting the IEC standards

16 Sharing Main Memory Segmentation and Paging

15 Sharing Main Memory Segmentation and Paging

Implementing a Serial Download Manager for Two 256K Byte Flash Memories

Unit Testing as Hypothesis Testing

Errors. And How to Handle Them

Frequently Asked Questions about Real-Time

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Fix serial communications not functioning after reboot when the station number is set to iptables Fix iptables.

Memory: Overview. CS439: Principles of Computer Systems February 26, 2018

Transcription:

High Reliability Systems Lloyd Moore, President Lloyd@CyberData-Robotics.com www.cyberdata-robotics.com

Overview Appropriate Use of This Presentation Causes of Failures Watchdogs Memory Techniques Safer Coding Practices Safe Shutdown Practices Summary

Appropriate Use of this Presentation This presentation is intended to fill a middle region between normal software development practices and formal high reliability specifications such as: MISRA, DO-178B, PCI-DSS, IEC 62304 and many others. IF THE PROJECT YOU ARE WORKING ON IS SUBJECT TO FORMAL RELIABILITY GUIDELINES AND/OR SPECIFICATIONS THIS PRESENTATION IS NOT FOR YOU FOLLOW THE APPROPERATE GUIDELINES TO THE LETTER! Ok if you are still reading then what is this presentation about? The above guidelines do not apply to every case and are too heavy for many projects. This presentation will cover techniques that can be used as needed to make a project better in terms of reliability, but without going so far as to increase the development cost of the project.

Causes of Failures Software bugs!!! The program does exactly what you said to do, just not what you intended to do! Electrical and environmental noise Both noisy power lines and stray induced magnetic fields can alter system state ESD (static electricity ) can alter memory and register contents randomly Operator abuses They did WHAT?!?!?!?!?!?!?!?!? Resource limitations Memory fragmentation, unexpectedly large inputs, unexpectedly long times Failures in other parts of the system Mechanical changes and/or failures Component failures which cascade, but do not fully disable the system Networking and communications issues

The Basics Set the compiler to the most sensitive warning level, and ensure the code builds with ZERO warnings Use of pragmas to clear warnings is VERY debatable, unused variables like the only one that is a valid case Use safe libraries No naked pointers Use APIs with length checking to avoid buffer overruns Some groups also now have no naked loops rule may not achievable in all cases Have an established development and test process Use revision control, defect tracking, and whatever else you believe is a best practice Do code reviews on ALL code builds team understanding and finds bugs early Unit test coverage should be 100% of critical code and as much of non-critical code as management will afford you When there is a failure do a root cause analysis and update deficient processes The key is to have SOMETHING in place to which improvements can be made over time This slide is NOT complete as there are MANY talks and debates covering these topics!

Watchdogs General Definition: Maintain surveillance over (person, activity, situation) In our case refers to an independent piece of hardware which monitors the desired process and either shuts down or resets the desired process if some condition is not met. Most common form is the Watchdog Timer, which is a dedicated piece of hardware which will reset the main processor if it does not see a specific activity happen within a specified time interval. The activity is generally toggling an I/O line or writing one or more values to specific registers. Two general forms of this: on chip and off chip advantages and disadvantages to both and some feel quite strongly over which is better! Desktop PC motherboards can also be purchased with watchdog hardware!

Watchdog Behavior Single Stage Watchdog: Must toggle a line or write a specific value to a location every X ms to reset the watchdog timer If the watchdog reset event does not happen the watchdog resets the system / main processor Windowed Watchdog: Must reset every X ms but not more often then every Y ms Protects against more cases than Single Stage Watchdog Multi-stage Watchdog: Must toggle a line high then low, toggle multiple lines or write multiple specific values to specific locations at some predefined time interval(s) Specifics vary from device to device Key is that you can ensure multiple locations in your code are executing in the desired order

Typical Watchdog Program Pattern void main() { initialize_system(); while(1) { read_sensors(); reset_watchdog(); write_outputs(); sleep_until_next_cycle(); } } General idea is that the watchdog gets reset once per main loop, will want the time out of the watchdog to be barely longer than the longest execution time of the main loop. If using a multi-stage watchdog can position one call just before read_sensors() and another call just before write_outputs() now you can verify that the sensors were read before the outputs were written.

Watchdog Gotchas! DO NOT put watch dog resets into interrupt calls unless the only thing you care about is verifying that the interrupt is still running! On-chip watchdogs are typically disabled when in debugging mode, off-chip watchdogs are not. Typical issue is you connect the debugger, hit a break point and your system resets! Recommended practice have the reset signal trace on the board connected by default, but allow for a jumper location. On debug boards cut the trace and install the jumper, on production boards leave the trace alone, and no jumper. As your code grows and changes the length of time for the watchdog timeout will also change keep an eye on this as watchdog resets will look like system crashes when you are developing!!!

Memory (I/O) Lockout Regions Specific regions in memory that are protected by some form of lockout. These are typically assisted by dedicated hardware but can also be emulated with a MMU. Goal is to prevent accidental writes to some type of critical control. Various forms of this: Location can only been written X clock cycles after reset Location can only been written once after reset Location is protected by some other location which must have a key value currently written to it Note in this case you may NOT want to have the key access and protected value access in a common routine! Remember to always clear the key value when you are done updating! Very commonly used to protect the on-chip watchdog timer, both in terms of configuring the timer and writing to the reset location. On chips with FPGA style resources you can also build your own protection to do specifically what is needed.

Memory Allocation Patterns In long running systems memory fragmentation becomes a big issue. Many embedded systems run for years without a reboot and don t have any virtual memory system to hide fragmentation. In these cases dynamic memory allocation becomes a source of instability! Potential solutions: Use only static allocations memory usage known at compile time For embedded microcontrollers actually a very desirable solution Use only automatic allocations everything will end up on the stack Watch your stack space here trades fragmentation for stack overflow Use dynamic memory allocation but only once at system startup If using C++ may want to disable new() and delete() to prevent hidden allocations Will also preclude using portions of the standard library! Use dedicated heap(s) and re-initialize it every so often

Data Integrity Issues Data values themselves can change OUTSIDE OF program control! Most of us are familiar with overwrite type problems, however in some systems this isn t the only issue. Memory can also be affected by: Bad memory locations Electrical noise Electrostatic discharge (special form of electrical noise) Environmental radiation Note that this may not happen very often but it does happen! Electrostatic discharge is a particularly common event in many areas, particularly in low humidity conditions. Some systems address this issue with ECC memory.

Data Validation Methods General principle is to keep critical data in a common data structure. Now you can operate on the data as a set and this gives you some advantages: Data can be checked easily Sentinel values can be placed into the data structure and tested at regular intervals these are constant values through the life of the program Whole data structure can be checksum / CRC validated at key points Data structure can be mirrored and again validated at key points These techniques work best when the program duty cycle is low, and checking is done during the idle times. Can also incorporate this into the watchdog reset routine such that the watchdog is only reset if the data validation tests pass.

Memory Munging Embedded microcontrollers will typically have extra memory which is not used by the application. System reliability can be improved by properly filling the memory with specific data: Unused RAM can be filled with a given pattern, and that pattern verified as described in the data validation slide Particularly useful to do this just beyond the maximum expected stack Unused flash/rom memory can be filled to trigger a reset or halt if the program ever jumps out of the defined program region This is a processor dependent technique Fill memory with NOP instructions will cause most processors to loop around to start of memory just like a reset (beware of memory lockouts!) Fill memory with reset instructions some processors have this others don t Fill memory with jumps to a common safe halting or reset routine Note: Generally DO NOT want to fill memory with HALT instructions! If you just halt you don t know the system is in a safe state

State Tracking This is a technique where each major step of the program verifies that it was called from the correct location. The idea is to abort if any segment of code is called from an unexpected path. By necessity this will make your program very rigid! Typically involves some type of check at the beginning of each critical routine. For a state machine this could simply be checking the prior state as a precondition to executing the current state. In the most general case this would be checking the call stack to make sure the caller is one of an expected set.

Validation Boundaries The general idea here is to have specific boundaries in your program where you fully check parameters being passed and/or overall system state. Very similar in concept to threat modeling called trust boundaries. 1. Divide your application to specific layers and modules (should be doing this anyway!) 2. Anytime flow crosses from one layer or module to another any data being passed gets sanity checked 3. Extreme version of this is checking at the entry to EVERY function call may not be feasible due to knowledge or time limitations Has the benefit of pushing sanity checks to the various module APIs of the application where they are most easily accomplished and most easily verified by code review to exist!

Time Calculations Time calculations are a frequent source of one time errors, specifically: Leap year events End of year events The 49 day roll-over event (and similar) (32 bit int used as ms timer) Recommendations: Always use full date / time values for calculations Use standard libraries for time manipulation, do not invent your own! Scale simple counters to have a lifetime of exceeding the maximum possible lifetime of your program execution Battery powered devices, at least 2x expected battery life, assuming batteries come out and force a reset Other devices just use a 64 bit int! Typically gives millions of years good enough!

Techniques to Avoid Recursion Great for solving some types of problems but in general will lead to stack overflows which are dependent on data values Threading Again great for certain types of problems but getting threading correct is HARD!! In many embedded applications threading can be simulated by the use of interrupts Hardware guarantee of priority On some processors only one can be in flight at any time

To Halt or To Reset????? This is a VERY application dependent question and can go either way! Error conditions will generally appear to happen at random times, therefore the state of your system when an error conditions occurs should not be assumed. Motors could be on moving machinery Heating / cooling elements can be on Communication transaction could be in process General recommendation here is to have ONE routine which places the system into a safe condition for your application. This routine is called at startup and also called anytime an error condition is detected. Note that this also means the safe condition routine gets tested regularly! Question of halting or resetting now becomes one of desired behavior do you want the process to continue without human intervention? Can also use scheduled resets to improve system reliability.

Use an At Exit Routine Register an exit function with the language / framework you are using and redirect to your safe condition routine. This function is called automatically any time the program terminates normally. Always terminate the program though a controlled mechanism (do not use abort() or similar methods). Catch all top level exceptions so you always exit clean, some don t like this but ensures your safe condition routine is run at exit. For C: int atexit(void (*func)(void) ); For C++: extern int atexit(void (*func)(void) ); For C++ 11: extern int atexit(void (*func)(void) ) noexcept; Note: Not typically available on straight microcontroller environments.

Summary In real world application errors can come from sources other than programming bugs. Make sure you are following all the basics of good coding practices Watchdog timers are the most common form of error detection used on systems, however to get maximum benefit the watchdog needs to be used correctly. Most non-bug related failures come as a result of environmental influences corrupting memory and there are several techniques available to detect this condition without having to resort to ECC memory. Knowing the expected flow of your program opens up further opportunity for validation.

Questions?