Introduction to Robust Systems

Similar documents
Robust System Design with MPSoCs Unique Opportunities

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés

Error Resilience in Digital Integrated Circuits

416 Distributed Systems. Errors and Failures Oct 16, 2018

6.033 Lecture Fault Tolerant Computing 3/31/2014

416 Distributed Systems. Errors and Failures Feb 1, 2016

Why Things Break -- With Examples From Autonomous Vehicles ,QVWLWXWH IRU &RPSOH[ (QJLQHHUHG 6\VWHPV

FAULT TOLERANT SYSTEMS

TSW Reliability and Fault Tolerance

Today. ESE532: System-on-a-Chip Architecture. Message. Intel Xeon Phi Offerings. Preclass 1 and Intel Xeon Phi Offerings. Intel Xeon Phi Offerings

Failures associated with HCI

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan

The 10 Disaster Planning Essentials For A Small Business Network

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)

VLSI Test Technology and Reliability (ET4076)

Self-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University

PowerVault MD3 Storage Array Enterprise % Availability

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007

ZKLWHýSDSHU. 3UHð)DLOXUHý:DUUDQW\ý 0LQLPL]LQJý8QSODQQHGý'RZQWLPH. +3ý 1HW6HUYHUý 0DQDJHPHQW. Executive Summary. A Closer Look

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan

Fault-Tolerance I: Atomicity, logging, and recovery. COS 518: Advanced Computer Systems Lecture 3 Kyle Jamieson

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Qualification for Reliability and other programs at the

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Aerospace Software Engineering

Fault Tolerant Computing CS 530

Safety Architecture Patterns

Wireless Network Virtualization: Ensuring Carrier Grade Availability

Critical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability

Fault-Tolerant Storage and Implications for the Cloud Charles Snyder

Very Large Scale Integration (VLSI)

VLSI Test Technology and Reliability (ET4076)

Today: Fault Tolerance

Rediffmail Enterprise High Availability Architecture

MediaTek Overview AI/5G-enabled Systems Test Challenges Systems Orientation New Opportunities

Tolerating Hardware Device Failures in Software. Asim Kadav, Matthew J. Renzelmann, Michael M. Swift University of Wisconsin Madison

Introduction to Surge Protection

Dependability. IC Life Cycle

Fault Isolation for Device Drivers

The 10 Disaster Planning Essentials

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO standard

Issues in Programming Language Design for Embedded RT Systems

New Reliable Reconfigurable FPGA Architecture for Safety and Mission Critical Applications LOGO

What Comes Next? Reconfigurable Nanoelectronics and Defect Tolerance. Technology Shifts. Size Matters. Ops/sec/$

Diagnosis in the Time-Triggered Architecture

Introduction to Business continuity Planning

DATAPATH ARCHITECTURE FOR RELIABLE COMPUTING IN NANO-SCALE TECHNOLOGY

Today: Fault Tolerance. Fault Tolerance

IMPLICATIONS OF RELIABILITY ENHANCEMENT ACHIEVED BY FAULT AVOIDANCE ON DYNAMICALLY RECONFIGURABLE ARCHITECTURES

Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist

Page 1. Outline. A Good Reference and a Caveat. Testing. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Testing and Design for Test

CMOS Testing: Part 1. Outline

After the Attack. Business Continuity. Planning and Testing Steps. Disaster Recovery. Business Impact Analysis (BIA) Succession Planning

Today: Fault Tolerance. Replica Management

Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware

Interplay Between Language and Formal Verification

Chapter 8. Coping with Physical Failures, Soft Errors, and Reliability Issues. System-on-Chip EE141 Test Architectures Ch. 8 Physical Failures - P.

VERIFYING SOFTWARE ROBUSTNESS. Ross Collard Collard & Company

Power Quality. The Overlooked Productivity Variable

Fault-Tolerant Embedded System

Fault-Tolerant Embedded System

High Availability Using Fault Tolerance in the SAN. Wendy Betts, IBM Mark Fleming, IBM

BEST PRACTICES FOR IMPROVING EXTERNAL DNS RESILIENCY AND PERFORMANCE

A SKY Computers White Paper

Automatic Generation of Availability Models in RAScad

Power Systems High Availability & Disaster Recovery

Fault tolerance and Reliability

BUSINESS CONTINUITY: THE PROFIT SCENARIO

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Distributed Systems COMP 212. Lecture 19 Othon Michail

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Fault Tolerance. Distributed Systems IT332

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation

SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS

Fault Tolerance in Parallel Systems. Jamie Boeheim Sarah Kay May 18, 2006

POWER4 Systems: Design for Reliability. Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX

Disaster Recovery Solutions With Virtual Infrastructure: Implementation andbest Practices

Causes of Software Failures

Nooks. Robert Grimm New York University

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

DEPENDABLE PROCESSOR DESIGN

March 4-7, 2018 Hilton Phoenix / Mesa Hotel Mesa, Arizona Archive

Root cause codes: Level One: See Chapter 6 for a discussion of using hierarchical cause codes.

Stochastic Processors (or processors that do not always compute correctly by design)

Provided as an educational service by: Introduction

Module 8 Fault Tolerance CS655! 8-1!

Service. Sentry Cyber Security Gain protection against sophisticated and persistent security threats through our layered cyber defense solution

High Availability Using Fault Tolerance in the SAN. Mark S Fleming, IBM

Dependability tree 1

DTC P0AA6 - High Voltage Isolation Fault

AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware

EOS: An Extensible Operating System

Today: Fault Tolerance. Failure Masking by Redundancy

SIListra. Coded Processing in Medical Devices. Dr. Martin Süßkraut (TU-Dresden / SIListra Systems)

Transcription:

Introduction to Robust Systems Subhasish Mitra Stanford University Email: subh@stanford.edu 1

Objective of this Talk Brainstorm What is a robust system? How can we build robust systems? Robust systems research directions How can EE392U be beneficial? Not covered: Specific solutions 2

What is a Robust Design? Quote from www.amsup.com/robust_desig/index.htm Not just strong Flexible! Idiot-proof! Simple! Efficient! Consistent high-level performance Wide range of changing conditions Client & manufacturing related Anticipated vs. unanticipated 3

Robust Computing System Defects, Process variation, Degraded transistors Radiation, Noise Inputs Design errors, Software failures Robust System Malicious attacks, Human errors Acceptable Outputs Performance Power Data Integrity Availability Security 4

Availability & Data Integrity Availability: Probability system operational at time t Telecom: 99.999% 5 mins./ year downtime Data integrity: no undetected errors $20K not interpreted as $3,616 Beagle2 mission presumed to be lost (www.beagle2.com) 5

Safety Active safety for drive-by-wire systems Implantable medical devices Nano-robot assisted remote surgery Context-aware pro-active healthcare system 6

Drive-by-wire a Reality What about reliability? 7

Security Major adversaries Security thefts Virus, hacks, spam, Terrorists 8

Power & Performance 9

Server Reliability Goals MTBF = Mean Time Between Failures Taken from Bossen, IRPS 2002 10

Causes of System Failures? Depends on who you talk to Application domains PCs vs. servers Medical devices, automotive, System configurations (& costs) Hardware costs & lifetime Single vs. clusters Application-specific vs. general-purpose 11

Windows XP Failures [Murphy, ACM Queue, Nov. 2004] 5% Microsoft software bugs, 12% hardware, 83% 3 rd party What you call a bug? Hardware failures 3rd party driver crashes Processor 17% General 26% BIOS 1% Disk 26% Memory 30% Firewall 4% Modem 9% Others 20% Audio 9% CD-Burning 10% Display 35% Anti-Virus 13% 12

Windows XP Failures: Observations 3 rd party drivers crashes bug definition? Increasing hardware failures aging hardware? Good processor reliability enablers Short PC lifetime Speed & voltage guardbands during design Price: power & performance cost Classical scaling was sufficient Inexpensive test & reliability screens BUT, progressively harder in sub-65nm 13

Causes of Server Unavailability Data from the Past Total Outage Cause Scheduled Unscheduled Unscheduled Outage Cause Other Software Hardware Ack: Lisa Spainhower, IBM For 24x7 must address both scheduled & unscheduled Often operator error predominates as source of downtime 14

Server Failures: Observations Most software bugs soft Heisenbugs Gone after reboot / restart Repair time is key here Operator errors major issue going forward High hardware reliability Enabled by hardware redundancy, BUT Redundancy expensive Hardware failure rates increasing Performance & power scaling slowdown 15

Why Worry About Hardware Reliability? Major process variation Worst-case design impractical Perfect design verification + test not enough Manufacturing process imperfect Testing imperfect: Warranty failures Transient errors during system operation e.g., noise, radiation induced soft errors Aging : e.g., slow transistors with time 16

Process variation: Power & Performance Impact Delay typ max min Ack: Prof. Giovanni De Micheli Voltage 17

Bathtub Curve Marginal parts due to defects, e.g., gate-tosource shorts, small opens, poor vias & contacts; Mitigated by Burn-in Transistor degradation (e.g., PMOS threshold voltage shift), electromigration, oxide breakdown; Mitigated by conservative design (overdesign) to avoid failures during intended product lifetime Failure Rate Infant Mortality Normal lifetime e.g., soft errors in memories mitigated by Error Correcting Codes Wearout Period 1-20 weeks ~ 3-15 years Time 18

(Scary?) Bathtub: Future Technologies Advanced technologies: burnin out of steam? Advanced technologies: increasing wearout failures Failure Rate Advanced technologies: increasing transient errors Infant mortality Normal lifetime Wearout Time Exciting opportunities for new system design techniques to cope with failures 19

Related EE392U Seminars Larry Votta, Distinguished Engineer, SUN Oct. 3 Why Do Systems Fail? Oct. 31 Lisa Spainhower, Distinguished Engineer, IBM Estimating the Risk of Releasing Software, Nov. 7 Brendan Murphy, Microsoft Research Reliable Design from Unreliable Components Nov. 14 Shekhar Borkar, Fellow, Intel Columbia Disaster Talk Nov. 28 Prof. Greg Kovacs, Stanford 20

How to Build Robust Systems? Avoidance Conservative design Design validation Thorough hardware & software test Infant mortality screen for hardware Transient error avoidance Proper interfaces to minimize operator errors Correct by Construction Simply Not sufficient Several challenges in future 21

How to Build Robust Systems? Tolerance Error detection during system operation Permanent & correlated hardware failures? Bohrbugs vs. Heisenbugs? On-line monitoring & diagnostics Self-recovery & Self-repair Automated self-managing systems Major Challenge: PROVE these WORK! Classical fault-tolerance very expensive Classical fault-tolerance inadequate 22

Fault-Tolerant Computing 23

High Availability Building Blocks Fault Tolerance Fault Avoidance Spare/ Degrade Concurrent Repair Design Verification System Integration SW Recover Failure Masking Reliability Integration Detect & Isolate Data Integrity System Design Technology 24

Related EE392U Seminars Larry Votta, Distinguished Engineer, SUN Oct. 3 System effects & error protection Oct. 10 Prof. Ravi Iyer, University of Illinois at Urbana Champaign Fault Tolerance in Space Environments Oct. 17 Dr. Philip Shirvani, nvidia Trusted systems: Prof. Hector Garcia Molina Oct. 24 Why Do Systems Fail? Oct. 31 Lisa Spainhower, Distinguished Engineer, IBM Estimating the Risk of Releasing Software, Nov. 7 Brendan Murphy, Microsoft Research 25

Robust Systems as Research Area CRA Recommendations Trouble-free systems PCs zero administration Large-scale systems Millions of users Administered by single person Self-diagnosing, self-healing, self-evolving Dependable and survivable systems Secure, safe, reliable, available 26

Importance in Revolutionary Nanotechnology Revolutionary nanotechnologies e.g., Molecular electronics Well-acknowledged fact Regular structures Defect prone Errors during normal operation ~ 5-10% faulty Must be self-healing! 27

Broad Research Directions: Looking for Interested Students Understand failures for various applications PCs, large servers, embedded (e.g., cars, digital home), space systems, healthcare Expertise required Experimental data collection Simulation & modeling Circuits, architecture, systems, HCI 28

Broad Research Directions: Looking for Interested Students New robust system design techniques Failure avoidance & resilience support Expertise required Circuit / logic design Architecture Compiler & operating systems Human-Computer interaction 29

Broad Research Directions: Looking for Interested Students Robust system prototypes PROVE that the system works! How? Not just simulation Build real system prototypes How? Nanotechnology architectures Built-in defect & fault-tolerance? Conventional methods work for very low failure rates 30