EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

Size: px

Start display at page:

Download "EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1"

Cecilia Flynn
5 years ago
Views:

1 EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 1

2 Announcements Don t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint 2 Project presentations next week Let us know if you are OK with presenting on Tuesday May 24th EE 382C - S11 - Lecture 14 2

3 Question of the day Consider a symmetric multiprocessing (SMP) network that does not allow packet loss and needs an availability of Link BER is Router components have failure rate of 1000 FITS How best can you achieve this reliability requirement EE 382C - S11 - Lecture 14 3

4 Reliability: R(t) Reliability and Availability Probability that system is working at time t given that it was working at time t=0, and has had no failures in between Availability: A(t) Probability that the system is working when needed, at a given point in time t Often affected by repair process A ~ (MTBF/(MTBF+MTTR)) MTBF: mean time between failures FIT: failures in time. Inverse of MTBF with zero repair time MTTR: mean time to recovery RAS requirements: Reliability, availability and serviceability EE 382C - S11 - Lecture 14 4

5 Examples of RAS Requirements Enterprise Server A = System level requirement Can reflect to a network-level requirement or detect and recover from network failures In general every packet must be correctly received or system will fail Internet Router A = But OK to drop packets (at rate of ) Turn failures into packet drops EE 382C - S11 - Lecture 14 5

6 RAS Requirements in Those Systems Dropping (reliability) Allowed or not Rate allowed (e.g., ) Availability (A) to Serviceability (MTTR) EE 382C - S11 - Lecture 14 6

7 MTTF and MTTR A MTTF MTTF MTTR EE 382C - S11 - Lecture 14 7

8 Failure Modes and Fault Models Failure Mode Model Rate Units Gaussian Noise on Channel Transient BER Alpha Particle Strike on Memory Soft 10-9 SER Alpha Particle Strike on Logic Transient BER Electromigration Stuck-at 1 MTBF Connector corrosion Stuck-at 10 MTBF Operator Removes Module Fail-Stop 10 5 MTBF Software Failure Fail-Stop 10 4 MTBF EE 382C - S11 - Lecture 14 8

9 An Analogy EE 382C - S11 - Lecture 14 9

10 The Bathtub Curve Failure Rate (FITS) Infant Mortality 10 Wearout Time (hours) EE 382C - S11 - Lecture 14 10

11 Detection, Containment, and Recovery Three-step program to dealing with errors 1. Detection discover the error CRC codes on channels Parity or ECC codes on memories Self-checking logic 2. Contain prevent the error from propagating further Mask it Drop the packet (and retry) Fail stop 3. Recover resume normal service Return to a known state Resume sending traffic Possibly resend faulted packet EE 382C - S11 - Lecture 14 11

12 Example Link Level Error Control Sending Router Receiving Router Retransmit Control Error Check Tx Flit Buffer Channel Input Unit Detection CRC on channel Containment Drop packet with error Recovery Request retransmission and resume normal sequence How can this fail? How to fix it? EE 382C - S11 - Lecture 14 12

13 Link-Level Error Control (2) Tx Channel Flit 1 Flit 2 Flit 3 Flit 4 Flit 5 Flit 6 Flit 2 Flit 3 Flit 4 Flit 5 Flit 6 Rx Channel Flit 1 Error Flit 3 Flit 4 Flit 5 Flit 6 Flit 2 Flit 3 Flit 4 Flit 5 Rx Ack Ack 1 Error 2 Ignore Ignore Ignore Ignore Ack 2 Ack 3 Ack 4 Tx Ack Ack 1 Error 2 Ignore Ignore Ignore Ignore Ack 2 Ack 3 Flit 2 was in error. Flits 2-6 are retransmitted Why would you want to retransmit flits 3-6? Pointers: Ack: next flit to be ACKed Tx: next flit to be transmitted Tail: next free slot Ack Pointer Tx Pointer Tail Pointer Flit 1 Flit 2 Flit 3 Flit 4 Flit 5 Flit 6 EE 382C - S11 - Lecture 14 13

14 Channel Configuration Reconfigure channels with frequent errors Swap in spare bits Reduce width of channel Reduce bit rate If malfunctions continue, decommission channel Assumes routing algorithm will adapt EE 382C - S11 - Lecture 14 14

15 Cray BlackWidow Example Each channel is 3-bits wide at 6.25Gb/s per bit (b = 18.75Gb/s) 3-bits serialized from 24-bit flit Link-level retry rates monitored Each retry attributed to one bit of the channel If retry rate exceeds a threshold bad bit is switched off Channel degrades to two-bits, then one-bit, then is switched off EE 382C - S11 - Lecture 14 15

16 What would happen if: Router Error Control Header bit in input buffer flips Credit count is corrupted Router picks wrong output Selected output flips mid packet Numerous failure modes inside the router Many lead to catastrophic failure Perhaps after hundreds of cycles after the error occurred Many others lead to insidious performance problems E.g., loosing credits EE 382C - S11 - Lecture 14 16

17 Router Error Control (2) Same steps of Detect, Confine, Recover apply Detect Parity or CRC on all storage and communication Quick consistency checks (e.g., on allocators and credits) Two copies of all other logic (in space or time) Confine Stop propagating faulty packets Operate via confinement regions (e.g., channel) Recover Reset to known good state (sometimes via reset) Resend faulted packets (if available) Disable part of the router (fault-containment regions) Replace part of the router (how swapping) EE 382C - S11 - Lecture 14 17

18 Network-Level Error Control Model faulty routers and links as fail-stop components Use adaptive routing to avoid them Table based recompute tables periodically Local adaptive pick another minimal link (or non-minimal) Need to avoid dead ends and deadlocks EE 382C - S11 - Lecture 14 18

19 End-To-End Error Control Keep a copy of each packet at source until acknowledged or timeout (This buffer can get large) If error detected Drop packet (Optionally) send a negative acknowledgement When packet correctly received Send positive acknowledgement When acknowledgement received Discard packet When negative acknowledgement received (or timeout) Resend packet May transmit the same packet multiple times EE 382C - S11 - Lecture 14 19

20 Question of the day Consider a symmetric multiprocessing (SMP) network that does not allow packet loss and needs an availability of Link BER is Router components have failure rate of 1000 FITS How best can you achieve this reliability requirement EE 382C - S11 - Lecture 14 20

21 Summary Specification sets reliability requirements Drop rate Availability Failures are abstracted with fault models Bit errors, soft errors, stuck-at, fail stop Detection, Containment, and Recovery Link-level Ack and retransmit Reconfigure Router level Detect all failures Mask, retry, or reset Network level Route around faulty components End-to-End Retransmit on nack or timeout EE 382C - S11 - Lecture 14 21

416 Distributed Systems. Errors and Failures Feb 1, 2016

416 Distributed Systems. Errors and Failures Feb 1, 2016 416 Distributed Systems Errors and Failures Feb 1, 2016 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note: