Temporal Redundancy. Yashwant K. Malaiya 10/30/2018 FTC YKM

Size: px

Start display at page:

Download "Temporal Redundancy. Yashwant K. Malaiya 10/30/2018 FTC YKM"

Griselda Nichols
5 years ago
Views:

1 Temporal Redundancy Yashwant K. Malaiya 1

2 Murphy s Law Anything that can go wrong, will. (Actually not by Murphy but by Finagle) To every law there is an exception. CS530 laws: Anything that can go wrong, it eventually will, but It may not go wrong for a while It may not go wrong the next time Only one thing may go wrong at a time October 30,

3 Temporal Redundancy Effective for time limited faults. Detection: Serial transmission: parity/crc Fault tolerance: Bus errors: Instruction retry Data-bases: check-point & roll-back Bad/lost network packets: retransmission Requirement: Save previous state(s) 3

Herodotus 4th BCon the Persians cent If an important decision is to be made, they [the Persians] discuss the question when they are drunk, and the following day the master of the house where the

4 Herodotus 4th BCon the Persians cent If an important decision is to be made, they [the Persians] discuss the question when they are drunk, and the following day the master of the house where the discussion was held submits their decision for reconsideration when they are sober. If they still approve it, it is adopted; if not, it is abandoned. Conversely, any decision they make when they are sober, is reconsidered afterwards when they are drunk. 4

5 Temporal Redundancy Effective for time limited faults. Detection: Serial transmission: parity/crc Fault tolerance: Bus errors: Instruction retry Data-bases: check-point & roll-back Bad/lost network packets: retransmission Requirement: Save previous state(s) in stable storage 5

6 Terminology Check-pointing: saving part of the process state Registers affected Context Part of the state (registers, memory) affected by next process segment Entire data base etc. Rollback: reestablishing a state of the process Audit Trail: chronological record of all transactions Retry: reexecution after rollback (inc. audit-trail reprocessing) 6

7 Strategy subtask i Chpt i Subtask i+1 x delay Rollback to Chpt i subtask i+1 retried Chpt i+1 subtask i+2 Failed retry? Temp fault still active when rollback done. Do another rollback. Fault permanent. Reconfigure hardware before rollback. Checkpoint i info bad. Rollback to checkpoint i-1. 7

8 Assumption Additional Fault No Overhead O(T) where V(T) Average V(T) Analysis of Overhead s : arrival inputs/err per T : : Average T T(F k ) 2 where k is utilizatio includes retry time F V(T) F : fixed retry time P{error rate : λ,interchkpt ors during time time retry time : during duration lost due chkpt/roll to save/load T}.avg n factor. time from last back error overhead Note to error and : T chkpt info overhead time chkpt to error to rollback. Justification? Why T/2? 8

9 Analysis of Overhead (2) Hence O(T) F k ρ(t) F T T 2 Minimum occurs at dρ dt T fractional F T opt Note : k 2 2F λ k λ k 2 overhead 0 ρ(t) T transactio n arrival rate transactio n processing rate : FRactional Overhead Fixed Variable (average) Total Intercheckpoint time Ex: =0.01, k=0.3, F=10 yields T opt =81.6 (above) 9

10 Networks: Packet Retransmission Information is divided into packets Each packet contains destination address, packet number, information slice, CRC Packet is routed through routers through the internet At the destination, each packet is checked. Packets are assembled back and delivered to destination. Should packets be small or large? 10

11 Networks: Packet Retransmission Stop and Wait protocol Packet sent from node A to node B Possibilities 1. Packet delivered correctly: B sends acknowledgment 2. Packet delivered corrupted: no ack or ack bad 3. Packet lost: no ack Cases 2,3 A sends the packet again ARQ : automatic repeat request 11

12 Stop &Wait Utilization When a frame is corrupted, lost (or ACK is lost), it is retransmit ted. Error U wher e t U e free 1 2t 1 t p f p : propagatio n time, If p is P{a frame lost}, then of n frames sent, n.p are lost. 1- p 2t 1 t f utilizatio n of a link between two nodes, p t f : frame duration ( stop & wait ) : 12

13 Stop and Wait utilization(2) If p U e c.t then f f p Optimal frame size is then t 2 p 4t p 16t p 8 tfopt c 2 Higher performance (U 1) obtained by sending multiple frames (slidingwindows). A copy of all frames must be 1-c.tf 2t 1 t saved until acknowledged ue( tf) tf 700 Ex: t p =25, c= Gives optimal t f =307 Unit: microsec 13

14 Undo Undo recovers from human mistakes (or bad decisions). Requires recording a trail and saving any deleted/overwritten information. This may take significant memory. Cannot undo some operations. Multilevel undo: last-in/first-out Selectable level Redo. 14

15 References A Survey of Analytic Models of Rollback and Recovery Stratergies, Chandy, K.M.; Computer, Volume 8, Issue 5, May 1975 Page(s):

Fault Tolerant Computing CS 530

Fault Tolerant Computing CS 530 Lecture Notes 1 Introduction to the class Yashwant K. Malaiya Colorado State University 1 Instructor, TA Instructor: Yashwant K. Malaiya, Professor malaiya @ cs.colostate.edu