EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

Size: px
Start display at page:

Download "EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1"

Transcription

1 EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 1

2 Announcements Don t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint 2 Project presentations next week Let us know if you are OK with presenting on Tuesday May 24th EE 382C - S11 - Lecture 14 2

3 Question of the day Consider a symmetric multiprocessing (SMP) network that does not allow packet loss and needs an availability of Link BER is Router components have failure rate of 1000 FITS How best can you achieve this reliability requirement EE 382C - S11 - Lecture 14 3

4 Reliability: R(t) Reliability and Availability Probability that system is working at time t given that it was working at time t=0, and has had no failures in between Availability: A(t) Probability that the system is working when needed, at a given point in time t Often affected by repair process A ~ (MTBF/(MTBF+MTTR)) MTBF: mean time between failures FIT: failures in time. Inverse of MTBF with zero repair time MTTR: mean time to recovery RAS requirements: Reliability, availability and serviceability EE 382C - S11 - Lecture 14 4

5 Examples of RAS Requirements Enterprise Server A = System level requirement Can reflect to a network-level requirement or detect and recover from network failures In general every packet must be correctly received or system will fail Internet Router A = But OK to drop packets (at rate of ) Turn failures into packet drops EE 382C - S11 - Lecture 14 5

6 RAS Requirements in Those Systems Dropping (reliability) Allowed or not Rate allowed (e.g., ) Availability (A) to Serviceability (MTTR) EE 382C - S11 - Lecture 14 6

7 MTTF and MTTR A MTTF MTTF MTTR EE 382C - S11 - Lecture 14 7

8 Failure Modes and Fault Models Failure Mode Model Rate Units Gaussian Noise on Channel Transient BER Alpha Particle Strike on Memory Soft 10-9 SER Alpha Particle Strike on Logic Transient BER Electromigration Stuck-at 1 MTBF Connector corrosion Stuck-at 10 MTBF Operator Removes Module Fail-Stop 10 5 MTBF Software Failure Fail-Stop 10 4 MTBF EE 382C - S11 - Lecture 14 8

9 An Analogy EE 382C - S11 - Lecture 14 9

10 The Bathtub Curve Failure Rate (FITS) Infant Mortality 10 Wearout Time (hours) EE 382C - S11 - Lecture 14 10

11 Detection, Containment, and Recovery Three-step program to dealing with errors 1. Detection discover the error CRC codes on channels Parity or ECC codes on memories Self-checking logic 2. Contain prevent the error from propagating further Mask it Drop the packet (and retry) Fail stop 3. Recover resume normal service Return to a known state Resume sending traffic Possibly resend faulted packet EE 382C - S11 - Lecture 14 11

12 Example Link Level Error Control Sending Router Receiving Router Retransmit Control Error Check Tx Flit Buffer Channel Input Unit Detection CRC on channel Containment Drop packet with error Recovery Request retransmission and resume normal sequence How can this fail? How to fix it? EE 382C - S11 - Lecture 14 12

13 Link-Level Error Control (2) Tx Channel Flit 1 Flit 2 Flit 3 Flit 4 Flit 5 Flit 6 Flit 2 Flit 3 Flit 4 Flit 5 Flit 6 Rx Channel Flit 1 Error Flit 3 Flit 4 Flit 5 Flit 6 Flit 2 Flit 3 Flit 4 Flit 5 Rx Ack Ack 1 Error 2 Ignore Ignore Ignore Ignore Ack 2 Ack 3 Ack 4 Tx Ack Ack 1 Error 2 Ignore Ignore Ignore Ignore Ack 2 Ack 3 Flit 2 was in error. Flits 2-6 are retransmitted Why would you want to retransmit flits 3-6? Pointers: Ack: next flit to be ACKed Tx: next flit to be transmitted Tail: next free slot Ack Pointer Tx Pointer Tail Pointer Flit 1 Flit 2 Flit 3 Flit 4 Flit 5 Flit 6 EE 382C - S11 - Lecture 14 13

14 Channel Configuration Reconfigure channels with frequent errors Swap in spare bits Reduce width of channel Reduce bit rate If malfunctions continue, decommission channel Assumes routing algorithm will adapt EE 382C - S11 - Lecture 14 14

15 Cray BlackWidow Example Each channel is 3-bits wide at 6.25Gb/s per bit (b = 18.75Gb/s) 3-bits serialized from 24-bit flit Link-level retry rates monitored Each retry attributed to one bit of the channel If retry rate exceeds a threshold bad bit is switched off Channel degrades to two-bits, then one-bit, then is switched off EE 382C - S11 - Lecture 14 15

16 What would happen if: Router Error Control Header bit in input buffer flips Credit count is corrupted Router picks wrong output Selected output flips mid packet Numerous failure modes inside the router Many lead to catastrophic failure Perhaps after hundreds of cycles after the error occurred Many others lead to insidious performance problems E.g., loosing credits EE 382C - S11 - Lecture 14 16

17 Router Error Control (2) Same steps of Detect, Confine, Recover apply Detect Parity or CRC on all storage and communication Quick consistency checks (e.g., on allocators and credits) Two copies of all other logic (in space or time) Confine Stop propagating faulty packets Operate via confinement regions (e.g., channel) Recover Reset to known good state (sometimes via reset) Resend faulted packets (if available) Disable part of the router (fault-containment regions) Replace part of the router (how swapping) EE 382C - S11 - Lecture 14 17

18 Network-Level Error Control Model faulty routers and links as fail-stop components Use adaptive routing to avoid them Table based recompute tables periodically Local adaptive pick another minimal link (or non-minimal) Need to avoid dead ends and deadlocks EE 382C - S11 - Lecture 14 18

19 End-To-End Error Control Keep a copy of each packet at source until acknowledged or timeout (This buffer can get large) If error detected Drop packet (Optionally) send a negative acknowledgement When packet correctly received Send positive acknowledgement When acknowledgement received Discard packet When negative acknowledgement received (or timeout) Resend packet May transmit the same packet multiple times EE 382C - S11 - Lecture 14 19

20 Question of the day Consider a symmetric multiprocessing (SMP) network that does not allow packet loss and needs an availability of Link BER is Router components have failure rate of 1000 FITS How best can you achieve this reliability requirement EE 382C - S11 - Lecture 14 20

21 Summary Specification sets reliability requirements Drop rate Availability Failures are abstracted with fault models Bit errors, soft errors, stuck-at, fail stop Detection, Containment, and Recovery Link-level Ack and retransmit Reconfigure Router level Detect all failures Mask, retry, or reset Network level Route around faulty components End-to-End Retransmit on nack or timeout EE 382C - S11 - Lecture 14 21

416 Distributed Systems. Errors and Failures Feb 1, 2016

416 Distributed Systems. Errors and Failures Feb 1, 2016 416 Distributed Systems Errors and Failures Feb 1, 2016 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:

More information

416 Distributed Systems. Errors and Failures Oct 16, 2018

416 Distributed Systems. Errors and Failures Oct 16, 2018 416 Distributed Systems Errors and Failures Oct 16, 2018 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

POWER4 Systems: Design for Reliability. Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX

POWER4 Systems: Design for Reliability. Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX Systems: Design for Reliability Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX Microprocessor 2-way SMP system on a chip > 1 GHz processor frequency >1GHz Core Shared L2 >1GHz Core

More information

A SKY Computers White Paper

A SKY Computers White Paper A SKY Computers White Paper High Application Availability By: Steve Paavola, SKY Computers, Inc. 100000.000 10000.000 1000.000 100.000 10.000 1.000 99.0000% 99.9000% 99.9900% 99.9990% 99.9999% 0.100 0.010

More information

ECE 574 Cluster Computing Lecture 19

ECE 574 Cluster Computing Lecture 19 ECE 574 Cluster Computing Lecture 19 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 November 2015 Announcements Projects HW extended 1 MPI Review MPI is *not* shared memory

More information

No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6

No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6 Announcements No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6 Copyright c 2002 2017 UMaine School of Computing and Information S 1 / 33 COS 140:

More information

Announcements. No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6

Announcements. No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6 Announcements No book chapter for this topic! Slides are posted online as usual Homework: Will be posted online Due 12/6 Copyright c 2002 2017 UMaine Computer Science Department 1 / 33 1 COS 140: Foundations

More information

UNIT IV -- TRANSPORT LAYER

UNIT IV -- TRANSPORT LAYER UNIT IV -- TRANSPORT LAYER TABLE OF CONTENTS 4.1. Transport layer. 02 4.2. Reliable delivery service. 03 4.3. Congestion control. 05 4.4. Connection establishment.. 07 4.5. Flow control 09 4.6. Transmission

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design

More information

Lecture 22: Fault Tolerance

Lecture 22: Fault Tolerance Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA 03, Wisconsin A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures, HPCA 07, Spain Error

More information

Page 1. Review: Internet Protocol Stack. Transport Layer Services. Design Issue EEC173B/ECS152C. Review: TCP

Page 1. Review: Internet Protocol Stack. Transport Layer Services. Design Issue EEC173B/ECS152C. Review: TCP EEC7B/ECS5C Review: Internet Protocol Stack Review: TCP Application Telnet FTP HTTP Transport Network Link Physical bits on wire TCP LAN IP UDP Packet radio Transport Layer Services Design Issue Underlying

More information

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan White paper Version: 1.1 Updated: Sep., 2017 Abstract: This white paper introduces Infortrend Intelligent

More information

ELEN Network Fundamentals Lecture 15

ELEN Network Fundamentals Lecture 15 ELEN 4017 Network Fundamentals Lecture 15 Purpose of lecture Chapter 3: Transport Layer Reliable data transfer Developing a reliable protocol Reliability implies: No data is corrupted (flipped bits) Data

More information

Outline. Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication. Outline

Outline. Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication. Outline Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication Khanh N. Dang and Xuan-Tu Tran Email: khanh.n.dang@vnu.edu.vn VNU Key Laboratory for Smart Integrated Systems

More information

Page 1. Review: Internet Protocol Stack. Transport Layer Services EEC173B/ECS152C. Review: TCP. Transport Layer: Connectionless Service

Page 1. Review: Internet Protocol Stack. Transport Layer Services EEC173B/ECS152C. Review: TCP. Transport Layer: Connectionless Service EEC7B/ECS5C Review: Internet Protocol Stack Review: TCP Application Telnet FTP HTTP Transport Network Link Physical bits on wire TCP LAN IP UDP Packet radio Do you remember the various mechanisms we have

More information

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan White paper Version: 1.1 Updated: Oct., 2017 Abstract: This white paper introduces Infortrend Intelligent

More information

I. INTRODUCTION. each station (i.e., computer, telephone, etc.) directly connected to all other stations

I. INTRODUCTION. each station (i.e., computer, telephone, etc.) directly connected to all other stations I. INTRODUCTION (a) Network Topologies (i) point-to-point communication each station (i.e., computer, telephone, etc.) directly connected to all other stations (ii) switched networks (1) circuit switched

More information

Data Link Technology. Suguru Yamaguchi Nara Institute of Science and Technology Department of Information Science

Data Link Technology. Suguru Yamaguchi Nara Institute of Science and Technology Department of Information Science Data Link Technology Suguru Yamaguchi Nara Institute of Science and Technology Department of Information Science Agenda Functions of the data link layer Technologies concept and design error control flow

More information

Communication Networks

Communication Networks Communication Networks Prof. Laurent Vanbever Exercises week 4 Reliable Transport Reliable versus Unreliable Transport In the lecture, you have learned how a reliable transport protocol can be built on

More information

Fault Tolerant Computing CS 530

Fault Tolerant Computing CS 530 Fault Tolerant Computing CS 530 Lecture Notes 1 Introduction to the class Yashwant K. Malaiya Colorado State University 1 Instructor, TA Instructor: Yashwant K. Malaiya, Professor malaiya @ cs.colostate.edu

More information

Deadlock and Router Micro-Architecture

Deadlock and Router Micro-Architecture 1 EE482: Advanced Computer Organization Lecture #8 Interconnection Network Architecture and Design Stanford University 22 April 1999 Deadlock and Router Micro-Architecture Lecture #8: 22 April 1999 Lecturer:

More information

Data Link Control. Surasak Sanguanpong Last updated: 11 July 2000

Data Link Control. Surasak Sanguanpong  Last updated: 11 July 2000 1/14 Data Link Control Surasak Sanguanpong nguan@ku.ac.th http://www.cpe.ku.ac.th/~nguan Last updated: 11 July 2000 Flow Control 2/14 technique for controlling the data transmission so that s have sufficient

More information

Lecture 4: CRC & Reliable Transmission. Lecture 4 Overview. Checksum review. CRC toward a better EDC. Reliable Transmission

Lecture 4: CRC & Reliable Transmission. Lecture 4 Overview. Checksum review. CRC toward a better EDC. Reliable Transmission 1 Lecture 4: CRC & Reliable Transmission CSE 123: Computer Networks Chris Kanich Quiz 1: Tuesday July 5th Lecture 4: CRC & Reliable Transmission Lecture 4 Overview CRC toward a better EDC Reliable Transmission

More information

Wireless TCP. TCP mechanism. Wireless Internet: TCP in Wireless. Wireless TCP: transport layer

Wireless TCP. TCP mechanism. Wireless Internet: TCP in Wireless. Wireless TCP: transport layer Wireless TCP W.int.2-2 Wireless Internet: TCP in Wireless Module W.int.2 Mobile IP: layer, module W.int.1 Wireless TCP: layer Dr.M.Y.Wu@CSE Shanghai Jiaotong University Shanghai, China Dr.W.Shu@ECE University

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés RELIABILITY and RELIABLE DESIGN Giovanni Centre Systèmes Intégrés Outline Introduction to reliable design Design for reliability Component redundancy Communication redundancy Data encoding and error correction

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

2. Software Generation of Advanced Error Reporting Messages

2. Software Generation of Advanced Error Reporting Messages 1. Introduction The PEX 8612 provides two mechanisms for error injection: Carter Buck, Sr. Applications Engineer, PLX Technology PCI Express Advanced Error Reporting Status register bits (which normally

More information

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016 416 Distributed Systems Errors and Failures, part 2 Feb 3, 2016 Options in dealing with failure 1. Silently return the wrong answer. 2. Detect failure. 3. Correct / mask the failure 2 Block error detection/correction

More information

User Datagram Protocol

User Datagram Protocol Topics Transport Layer TCP s three-way handshake TCP s connection termination sequence TCP s TIME_WAIT state TCP and UDP buffering by the socket layer 2 Introduction UDP is a simple, unreliable datagram

More information

Lecture 5: Flow Control. CSE 123: Computer Networks Alex C. Snoeren

Lecture 5: Flow Control. CSE 123: Computer Networks Alex C. Snoeren Lecture 5: Flow Control CSE 123: Computer Networks Alex C. Snoeren Pipelined Transmission Sender Receiver Sender Receiver Ignored! Keep multiple packets in flight Allows sender to make efficient use of

More information

Aerospace Software Engineering

Aerospace Software Engineering 16.35 Aerospace Software Engineering Reliability, Availability, and Maintainability Software Fault Tolerance Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT Definitions Software reliability The probability

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

EE 122: Error detection and reliable transmission. Ion Stoica September 16, 2002

EE 122: Error detection and reliable transmission. Ion Stoica September 16, 2002 EE 22: Error detection and reliable transmission Ion Stoica September 6, 2002 High Level View Goal: transmit correct information Problem: bits can get corrupted - Electrical interference, thermal noise

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

Data Link Control Protocols

Data Link Control Protocols Protocols : Introduction to Data Communications Sirindhorn International Institute of Technology Thammasat University Prepared by Steven Gordon on 23 May 2012 Y12S1L07, Steve/Courses/2012/s1/its323/lectures/datalink.tex,

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

High Availability and Redundant Operation

High Availability and Redundant Operation This chapter describes the high availability and redundancy features of the Cisco ASR 9000 Series Routers. Features Overview, page 1 High Availability Router Operations, page 1 Power Supply Redundancy,

More information

CPE 448/548 Exam #1 (100 pts) February 14, Name Class: 448

CPE 448/548 Exam #1 (100 pts) February 14, Name Class: 448 Name Class: 448 1) (14 pts) A message M = 11001 is transmitted from node A to node B using the CRC code. The CRC generator polynomial is G(x) = x 3 + x 2 + 1 ( bit sequence 1101) a) What is the transmitted

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

Dependability and ECC

Dependability and ECC ecture 38 Computer Science 61C Spring 2017 April 24th, 2017 Dependability and ECC 1 Great Idea #6: Dependability via Redundancy Applies to everything from data centers to memory Redundant data centers

More information

CS 43: Computer Networks. 16: Reliable Data Transfer October 8, 2018

CS 43: Computer Networks. 16: Reliable Data Transfer October 8, 2018 CS 43: Computer Networks 16: Reliable Data Transfer October 8, 2018 Reading Quiz Lecture 16 - Slide 2 Last class We are at the transport-layer protocol! provide services to the application layer interact

More information

Why Things Break -- With Examples From Autonomous Vehicles ,QVWLWXWH IRU &RPSOH[ (QJLQHHUHG 6\VWHPV

Why Things Break -- With Examples From Autonomous Vehicles ,QVWLWXWH IRU &RPSOH[ (QJLQHHUHG 6\VWHPV Why Things Break -- With Examples From Autonomous Vehicles Phil Koopman Department of Electrical & Computer Engineering & Institute for Complex Engineered Systems (based, in part, on material from Dan

More information

TCP Congestion Control

TCP Congestion Control TCP Congestion Control What is Congestion The number of packets transmitted on the network is greater than the capacity of the network Causes router buffers (finite size) to fill up packets start getting

More information

TCP Congestion Control

TCP Congestion Control What is Congestion TCP Congestion Control The number of packets transmitted on the network is greater than the capacity of the network Causes router buffers (finite size) to fill up packets start getting

More information

Lecture 7: Flow Control"

Lecture 7: Flow Control Lecture 7: Flow Control" CSE 123: Computer Networks Alex C. Snoeren No class Monday! Lecture 7 Overview" Flow control Go-back-N Sliding window 2 Stop-and-Wait Performance" Lousy performance if xmit 1 pkt

More information

Lecture 10: Link layer multicast. Mythili Vutukuru CS 653 Spring 2014 Feb 6, Thursday

Lecture 10: Link layer multicast. Mythili Vutukuru CS 653 Spring 2014 Feb 6, Thursday Lecture 10: Link layer multicast Mythili Vutukuru CS 653 Spring 2014 Feb 6, Thursday Unicast and broadcast Usually, link layer is used to send data over a single hop between source and destination. This

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a

More information

TCP: Flow and Error Control

TCP: Flow and Error Control 1 TCP: Flow and Error Control Required reading: Kurose 3.5.3, 3.5.4, 3.5.5 CSE 4213, Fall 2006 Instructor: N. Vlajic TCP Stream Delivery 2 TCP Stream Delivery unlike UDP, TCP is a stream-oriented protocol

More information

The flow of data must not be allowed to overwhelm the receiver

The flow of data must not be allowed to overwhelm the receiver Data Link Layer: Flow Control and Error Control Lecture8 Flow Control Flow and Error Control Flow control refers to a set of procedures used to restrict the amount of data that the sender can send before

More information

CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm

CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 2015 @ 4pm Your Name: Answers SUNet ID: root @stanford.edu Check if you would like exam routed

More information

Principles of Reliable Data Transfer

Principles of Reliable Data Transfer Principles of Reliable Data Transfer 1 Reliable Delivery Making sure that the packets sent by the sender are correctly and reliably received by the receiver amid network errors, i.e., corrupted/lost packets

More information

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS LECTURE 8: HARDWARE FAULT TOLERANCE TECHNIQUES Fall 2014 Avinash Kodi kodi@ohio.edu Acknowledgement: Daniel Sorin, Behrooz Parhami, Srinivasan Ramasubramanian

More information

The Transport Layer Reliability

The Transport Layer Reliability The Transport Layer Reliability CS 3, Lecture 7 http://www.cs.rutgers.edu/~sn4/3-s9 Srinivas Narayana (slides heavily adapted from text authors material) Quick recap: Transport Provide logical communication

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 30, 2018

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 30, 2018 CMSC 417 Computer Networks Prof. Ashok K Agrawala 2018 Ashok Agrawala October 30, 2018 Message, Segment, Packet, and Frame host host HTTP HTTP message HTTP TCP TCP segment TCP router router IP IP packet

More information

Outline: Connecting Many Computers

Outline: Connecting Many Computers Outline: Connecting Many Computers Last lecture: sending data between two computers This lecture: link-level network protocols (from last lecture) sending data among many computers 1 Review: A simple point-to-point

More information

416 Distributed Systems. Errors and Failures Feb 9, 2018

416 Distributed Systems. Errors and Failures Feb 9, 2018 416 Distributed Systems Errors and Failures Feb 9, 2018 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:

More information

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Authors: Robert L Akamine, Robert F. Hodson, Brock J. LaMeres, and Robert E. Ray www.nasa.gov Contents Introduction to the

More information

Wireless TCP Performance Issues

Wireless TCP Performance Issues Wireless TCP Performance Issues Issues, transport layer protocols Set up and maintain end-to-end connections Reliable end-to-end delivery of data Flow control Congestion control Udp? Assume TCP for the

More information

ECE 435 Network Engineering Lecture 10

ECE 435 Network Engineering Lecture 10 ECE 435 Network Engineering Lecture 10 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 September 2017 Announcements HW#4 was due HW#5 will be posted. midterm/fall break You

More information

Announcements. IP Forwarding & Transport Protocols. Goals of Today s Lecture. Are 32-bit Addresses Enough? Summary of IP Addressing.

Announcements. IP Forwarding & Transport Protocols. Goals of Today s Lecture. Are 32-bit Addresses Enough? Summary of IP Addressing. IP Forwarding & Transport Protocols EE 122: Intro to Communication Networks Fall 2007 (WF 4-5:30 in Cory 277) Vern Paxson TAs: Lisa Fowler, Daniel Killebrew & Jorge Ortiz http://inst.eecs.berkeley.edu/~ee122/

More information

Fault-Tolerance I: Atomicity, logging, and recovery. COS 518: Advanced Computer Systems Lecture 3 Kyle Jamieson

Fault-Tolerance I: Atomicity, logging, and recovery. COS 518: Advanced Computer Systems Lecture 3 Kyle Jamieson Fault-Tolerance I: Atomicity, logging, and recovery COS 518: Advanced Computer Systems Lecture 3 Kyle Jamieson What is fault tolerance? Building reliable systems from unreliable components Three basic

More information

ARQ and HARQ inter-working for IEEE m system

ARQ and HARQ inter-working for IEEE m system ARQ and HARQ inter-working for IEEE 802.16m system Document Number: IEEE C802.16m-08/1053r1 Date Submitted: 2008-09-17 Source: Xiangying Yang (xiangying.yang@intel.com) Yuan Zhu Muthaiah Venkatachalam

More information

The Walking Dead Michael Nitschinger

The Walking Dead Michael Nitschinger The Walking Dead A Survival Guide to Resilient Reactive Applications Michael Nitschinger @daschl the right Mindset 2 The more you sweat in peace, the less you bleed in war. U.S. Marine Corps 3 4 5 Not

More information

Outline. EEC-484/584 Computer Networks. Data Link Layer Design Issues. Framing. Lecture 6. Wenbing Zhao Review.

Outline. EEC-484/584 Computer Networks. Data Link Layer Design Issues. Framing. Lecture 6. Wenbing Zhao Review. EEC-484/584 Computer Networks Lecture 6 wenbing@ieee.org (Lecture nodes are based on materials supplied by Dr. Louise Moser at UCSB and Prentice-Hall) Outline Review Data Link Layer Design Issues Error

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018 TWO-PHASE COMMIT George Porter May 9 and 11, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons license These slides

More information

Chapter Six. Errors, Error Detection, and Error Control. Data Communications and Computer Networks: A Business User s Approach Seventh Edition

Chapter Six. Errors, Error Detection, and Error Control. Data Communications and Computer Networks: A Business User s Approach Seventh Edition Chapter Six Errors, Error Detection, and Error Control Data Communications and Computer Networks: A Business User s Approach Seventh Edition After reading this chapter, you should be able to: Identify

More information

T10/03-186r2 SAS-1.1 Transport layer retries

T10/03-186r2 SAS-1.1 Transport layer retries To: T10 Technical Committee From: Rob Elliott, HP (elliott@hp.com) and Jim Jones, Quantum (jim.jones@quantum.com) Date: 28 July 2003 Subject: T10/03-186r1 SAS-1.1 Transport layer retries T10/03-186r2 SAS-1.1

More information

Data Link Layer, Part 5 Sliding Window Protocols. Preface

Data Link Layer, Part 5 Sliding Window Protocols. Preface Data Link Layer, Part 5 Sliding Window Protocols These slides are created by Dr. Yih Huang of George Mason University. Students registered in Dr. Huang's courses at GMU can make a single machine-readable

More information

udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults 1/45 1/22 MICRO-46, 9 th December- 213 Davis, California udirec: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer

More information

CSE 123: Computer Networks Alex C. Snoeren. HW 1 due NOW!

CSE 123: Computer Networks Alex C. Snoeren. HW 1 due NOW! CSE 123: Computer Networks Alex C. Snoeren HW 1 due NOW! Automatic Repeat Request (ARQ) Acknowledgements (ACKs) and timeouts Stop-and-Wait Sliding Window Forward Error Correction 2 Link layer is lossy

More information

Bullet-Proofing PCIe in Enterprise Storage SoCs with RAS features

Bullet-Proofing PCIe in Enterprise Storage SoCs with RAS features Bullet-Proofing PCIe in Enterprise Storage SoCs with RAS features Michael Fernandez, Sr. FAE, PLDA Agenda What is RAS(M)? PCIe RAS features What s in the Spec. and what s not Limitations Case studies Problem

More information

CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm

CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 4pm CS144: Intro to Computer Networks Homework 1 Scan and submit your solution online. Due Friday January 30, 2015 @ 4pm Your Name: SUNet ID: @stanford.edu Check if you would like exam routed back via SCPD:

More information

Lecture 26: Data Link Layer

Lecture 26: Data Link Layer Introduction We have seen in previous lectures that the physical layer is responsible for the transmission of row bits (Ones and Zeros) over the channel. It is responsible for issues related to the line

More information

High Level View. EE 122: Error detection and reliable transmission. Overview. Error Detection

High Level View. EE 122: Error detection and reliable transmission. Overview. Error Detection High Level View EE 22: Error detection and reliable transmission Ion Stoica September 6, 22 Goal: transmit correct information Problem: bits can get corrupted - Electrical interference, thermal noise Solution

More information

03-186r5 SAS-1.1 Transport layer retries 13 January 2004

03-186r5 SAS-1.1 Transport layer retries 13 January 2004 To: T10 Technical Committee From: Rob Elliott, HP (elliott@hp.com) Date: 13 January 2004 Subject: 03-186r5 SAS-1.1 Transport layer retries Revision history Revision 0 (6 May 2003) first revision Revision

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

6.033 Lecture Fault Tolerant Computing 3/31/2014

6.033 Lecture Fault Tolerant Computing 3/31/2014 6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults

More information

Introduc)on to Computer Networks

Introduc)on to Computer Networks Introduc)on to Computer Networks COSC 4377 Lecture 7 Spring 2012 February 8, 2012 Announcements HW3 due today Start working on HW4 HW5 posted In- class student presenta)ons No TA office hours this week

More information

file:///c:/users/hpguo/dropbox/website/teaching/fall 2017/CS4470/H...

file:///c:/users/hpguo/dropbox/website/teaching/fall 2017/CS4470/H... 1 of 9 11/26/2017, 11:28 AM Homework 3 solutions 1. A window holds bytes 2001 to 5000. The next byte to be sent is 3001. Draw a figure to show the situation of the window after the following two events:

More information

Lecture 6: Multicast

Lecture 6: Multicast Lecture 6: Multicast Challenge: how do we efficiently send messages to a group of machines? Need to revisit all aspects of networking Last time outing This time eliable delivery Ordered delivery Congestion

More information

6.033 Computer Systems Engineering: Spring Quiz II THIS IS AN OPEN BOOK, OPEN NOTES QUIZ. NO PHONES, NO COMPUTERS, NO LAPTOPS, NO PDAS, ETC.

6.033 Computer Systems Engineering: Spring Quiz II THIS IS AN OPEN BOOK, OPEN NOTES QUIZ. NO PHONES, NO COMPUTERS, NO LAPTOPS, NO PDAS, ETC. Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.033 Computer Systems Engineering: Spring 2005 Quiz II There are 17 questions and 10 pages in this quiz

More information

Reliable Transport : Fundamentals of Computer Networks Bill Nace

Reliable Transport : Fundamentals of Computer Networks Bill Nace Reliable Transport 14-740: Fundamentals of Computer Networks Bill Nace Material from Computer Networking: A Top Down Approach, 6 th edition. J.F. Kurose and K.W. Ross Administration Stuff is due HW #1

More information

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Replica Management Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery

More information

Appendix D: Storage Systems (Cont)

Appendix D: Storage Systems (Cont) Appendix D: Storage Systems (Cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Reliability, Availability, Dependability Dependability: deliver service such that

More information

Name: uteid: 1. CS439: Fall 2011 Midterm 2

Name: uteid: 1. CS439: Fall 2011 Midterm 2 Name: uteid: 1 Instructions CS: Fall Midterm Stop writing when time is announced at the end of the exam. I will leave the room as soon as I ve given people a fair chance to bring me the exams. I will not

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

Introduction to Robust Systems

Introduction to Robust Systems Introduction to Robust Systems Subhasish Mitra Stanford University Email: subh@stanford.edu 1 Objective of this Talk Brainstorm What is a robust system? How can we build robust systems? Robust systems

More information

Lecture 3 The Transport Control Protocol (TCP) Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it

Lecture 3 The Transport Control Protocol (TCP) Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it Lecture 3 The Transport Control Protocol (TCP) Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it TCP segment structure URG: urgent data (generally not used) ACK: ACK # valid PSH: push

More information

Administrivia. FEC vs. ARQ. Reliable Transmission FEC. Last time: Framing Error detection. FEC provides constant throughput and predictable delay

Administrivia. FEC vs. ARQ. Reliable Transmission FEC. Last time: Framing Error detection. FEC provides constant throughput and predictable delay FEC vs. ARQ Administrivia FEC provides constant throughput and predictable delay If high error rate, need long codes/complex circuitry Does not protect against all errors, or packet loss Last time: Framing

More information

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015

Congestion Control In The Internet Part 2: How it is implemented in TCP. JY Le Boudec 2015 Congestion Control In The Internet Part 2: How it is implemented in TCP JY Le Boudec 2015 1 Contents 1. Congestion control in TCP 2. The fairness of TCP 3. The loss throughput formula 4. Explicit Congestion

More information

ECE/CSC 570 Section 001. Final test. December 11, 2006

ECE/CSC 570 Section 001. Final test. December 11, 2006 ECE/CSC 570 Section 001 Final test December 11, 2006 Questions 1 10 each carry 2 marks. Answer only by placing a check mark to indicate whether the statement is true of false in the appropriate box, and

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information