Data diverse software fault tolerace techiques Complemets desig diversity by compesatig for desig diversity s s limitatios Ivolves obtaiig a related set of poits i the program data space, executig the same software o those poits i the program data space, ad the usig a decisio algorithm to determie the resultig output Uses data re-expressio expressio algorithms (DRA) to obtai their iput data
Data diverse software fault tolerace techiques Retry block (RtB) N-copy programmig (NCP) Two-pass adjudicators (TPA)
Retry blocks (RtB) Categorized as a dyamic techique Data diverse complemet of RcB Uses acceptace tests ad backward recovery to accomplish fault tolerace Typically uses oe DRA, oe algorithm, ad a watch dog timer (WDT)
RtB operatio esure by else by else by else by else Acceptace Test Primary Algorithm (Origial Iput) Primary Algorithm (Re-expressed expressed Iput) Primary Algorithm (Re-expressed expressed Iput) [Deadlie Expires] Backup Algorithm (Origial Iput) failure exceptio
RtB structure ad operatio RtB etry Establish checkpoit Execute algorithm Yes New DRA exists ad deadlie ot expired No Exceptio sigals Evaluate AT Fail Restore checkpoit Ivoke backup Pass Discard checkpoit Pass Evaluate AT for backup RtB exit Fail Failure exceptio
Abbreviatios AT: Acceptace test ATB: Acceptace test for the backup (may be the same as AT) B: Backup algorithm DRA i : Data re-expressio expressio algorithm i (whe there are multiple DRA, or the i th re-expressio expressio of the iput if a sigle DRA is used) P: Primary algorithm RtB: Retry block WDT: Watchdog timer WP: Expected maximum wait time for a acceptable result from P
Failure-free operatio Upo etry to the RtB, the executive performs the followig: a checkpoit (or recovery poit) is established, a call to P is formatted, ad the WDT is set to WP P is executed. No exceptio or time-out occurs durig executio of P The results of P are submitted to the AT P s s results are o time ad pass the AT Cotrol returs to the executive The executive discards the checkpoit, clears the WDT, the results are passed outside the RtB, ad the RtB is exited
Exceptio i primary algorithm Upo etry to the RtB, the executive performs the followig: a checkpoit c (or recovery poit) is established, a call to P is formatted, ad d the WDT is set to WP P is executed. A exceptio occurs durig executio of P Cotrol returs to the executive. The executive checks to esure the deadlie for acceptable results has ot expired ad checks if there is a DRA optio available that has ot bee attempted o this iput The executive restores the checkpoit, the calls the DRA with the t origial iput data as its argumet The executive formats a call to P usig the re-expressed expressed iput P is executed. No exceptio or time-out occurs durig executio of P with the re-expressed expressed iput The result of P are submitted to the AT P s s results are o time ad pass the AT Cotrol returs to the executive The executive discards the checkpoit, clears the WDT, the results are passed outside the RtB, ad the RtB is exited
Primary s s results are o time, but fail AT, successful executio with re-expressed expressed iputs Upo etry to the RtB, the executive performs the followig: a checkpoit c (or recovery poit) is established, a call to P is formatted, ad the e WDT is set to WP P is executed. No exceptio or time-out occurs durig executio of P with the re- expressed iput The result of P are submitted to the AT P s s results fail the AT Cotrol returs to the executive. The executive checks to esure the deadlie for acceptable results has ot expired ad checks if there is a DRA optio available that has ot bee attempted o this iput The executive restores the checkpoit, the calls the DRA with the t origial iput data as its argumet The executive formats a call to P usig the re-expressed expressed iput P is executed. No exceptio or time-out occurs durig executio of P with the re- expressed iput The result of P are submitted to the AT P s s results are o time ad pass the AT Cotrol returs to the executive The executive discards the checkpoit, clears the WDT, the results are passed outside the RtB, ad the RtB is exited
All data re-expressio expressio algorithms are used without success; successful backup executio P s s results fail the AT Cotrol returs to the executive. The executive checks to esure the deadlie for acceptable results has ot expired ad checks if there is a DRA optio available that has ot bee attempted o this iput The executive restores the checkpoit, the calls the DRA 2 with the origial iput data as its argumet The executive formats a call to P usig the re-expressed expressed iput P is executed. No exceptio or time- out occurs durig executio of P with the re-expressed expressed iput The result of P are submitted to the AT P s s results are o time but fail the AT Cotrol returs to the executive. The executive checks to esure the deadlie for acceptable results has ot expired ad checks if there is a DRA optio available that has ot bee attempted o this iput The executive restores the checkpoit, formats a call to the backup, B, usig the origial iputs, ad ivokes B B is executed. No exceptio occurs durig executio of B The result of B are submitted to the ATB B s s results are o time ad pass the ATB Cotrol returs to the executive The executive discards the checkpoit, clears the WDT, the results are passed outside the RtB, ad the RtB is exited
All data re-expressio expressio algorithms are used without success; backup executes, but fails ATB Cotrol returs to the executive. The executive checks to esure the deadlie for acceptable results has ot expired ad checks if there is a DRA optio available that has ot bee attempted o this iput The executive restores the checkpoit, formats a call to the backup, B, usig the origial iputs, ad ivokes B B is executed. No exceptio occurs durig executio of B The result of B are submitted to the ATB B s s results are o time, but fail the ATB Cotrol returs to the executive The executive discards the checkpoit, clears the WDT; a failure exceptio is raised, ad the RtB is exited
Augmetatios to RtB operatio Use a DRA executio couter Used whe the primary fails o the origial iput ad primary executio is attempted with re-expressed expressed iputs Idicates the maximum umber of times to execute the primary with differet re-expressed expressed iputs Provides the ability to have a meas of imposig a deadlie without usig a timer Use a more detailed AT comprised of several tests
Multiuse sigle vs. multiple data re-expressio expressio algorithms 1st use of DRA durig executio withi RtB block x x DRA DRA 1 th use of DRA durig executio withi RtB block x x DRA DRA DRA(x) 1 DRA 1 (x) 1 2d use of DRA durig executio withi RtB block x x DRA(x) DRA(x) j DRA(x) k, j k DRA (x) DRA i (x) j DRA i (x) k, j k DRA DRA 2 DRA(x) 2 DRA 2 (x) 2
RtB example x ad y are measured by sesors with a tolerace of + 0.02 The origial program should ot receive a iput of x=0.0 because of the ature of the algorithm (our assumptio) y x Potetial divide-by-zero error domai
RtB example (1e -10, 2.2) (1e -10 +0.0021, 2.2) Checkpoit Primary algorithm f(x,y) Divide-by-zero error usig origial iputs Restore checkpoit 123.45 usig reexpressed iputs AT: f(x,y) 100.0 DRA 1: R1(x) = x + 0.0021 Pass
RtB issues ad discussio (-)) Time overhead (-)) Service is iterrupted durig recovery (+) It is aturally applicable to software modules, as opposed to whole systems (same as RcBs) The success of data diverse software fault tolerace techiques depeds o the performace of the re- expressio algorithm used Actually, selectio of DRA is more importat tha selectio of the approach! (-)) DRAs are very applicatio depedet Not easy to automate them! (-)) Ca suffer from domio effect (hope you remember it?)
N-copy programmig (NCP) Categorized as a static techique Data diverse complemet of NVP Use DM ad forward recovery to accomplish fault tolerace
NCP programmig operatio ru DRA 1, DRA 2,,, DRA ru Copy 1 (result of DRA 1), Copy 2 (result of DRA 2),, Copy (result of DRA ) if (Decisio Mechaism (Result 1, Result 2,,, Result )) retur Result else failure exceptio
NCP structure ad operatio NCP etry Distribute iputs DRA 1 DRA 2 DRA Copy 1 Copy 2 Copy Gather results Output selected DM Exceptio raised NCP exit Failure exceptio
Abbreviatios C i : Copy i DM: Decisio mechaism DRA i : Data re-expressio expressio algorithm i : The umber of copies NCP: N-copy N programmig R i : Results of C i x: Origial iput y i : Re-expressed expressed iput, y i = DRA i (x)
Failure-free operatio Upo etry to NCP, the executive seds the iput, x, to the DRA to be re-expressed expressed The DRA ru their re-expressio expressio algorithms o x, yieldig the re- expressed iputs y i = DRA i (x) The executive gathers the re-expressed expressed iput, formats calls to the copies ad through those calls distributes the re-expressed expressed iputs to the copies Each copy, C i, executes. No failures occur durig their executio The results of the copy executios (R i ) are gathered by the executive ad submitted to the exact majority DM The R i are equal to oe aother, so the DM selects R2 (radomly, sice the results are equal), as the correct result Cotrol returs to the executive The executive passes the correct result outside the NCP, ad the NCP module is exited
Failure sceario icorrect results Upo etry to NCP, the executive seds the iput, x, to the DRA to be re-expressed expressed The DRA ru their re-expressio expressio algorithms o x, yieldig the re-expressed expressed iputs y i = DRA i (x) The executive gathers the re-expressed expressed iput, formats calls to the copies ad through those calls distributes the re-expressed expressed iputs to the copies Each copy, C i, executes The results of the copy executios (R i ) are gathered by the executive ad submitted to the exact majority DM Noe of the R i are equal. The DM caot determie a correct result, ad it sets a flag idicatig this fact Cotrol returs to the executive The executive raises a exceptio ad the NCP module is exited
Failure sceario copy does ot execute Upo etry to NCP, the executive seds the iput, x, to the DRA to be re-expressed expressed The DRA ru their re-expressio expressio algorithms o x, yieldig the re- expressed iputs y i = DRA i (x) The executive gathers the re-expressed expressed iput, formats calls to the copies ad through those calls distributes the re-expressed expressed iputs to the copies The copies, C i, begi executio. Oe of more copies do ot complete executio for some reaso (e.g., stuck i a edless loop) op) The executive caot retrieve all copy results i a timely maer. The executive submits the results it does have to the DM The DM expects results, but receives -1 1 (or, -2, etc) results. The basic exact majority voter caot hadle fewer tha results ad sets a flag idicates its failure to select a correct result Cotrol returs to the executive The executive raises a exceptio ad the NCP module is exited
Augmetatios to NCP operatio Usig a differet DM tha the basic majority voter (such as?) Votig o the results as each copy completes executio as opposed to waitig o all copies to complete Combiatio with other techiques
NCP example x ad y are measured by sesors with a tolerace of + 0.02 The origial program should ot receive a iput of x=0.0 because of the ature of the algorithm y x Potetial divide-by-zero error domai
NCP example Distribute iputs (1e -10, 2.2) DRA 1: Pass-through R 1 (x)=x DRA 2: R 2 (x)=x+0.002 DRA 3: R 3 (x)=x+0.001 (1e -10, 2.2) (0.002+1e -10, 2.2) (0.001+1e -10, 2.2) Copy 1: f(x,y) Copy 2: f(x,y) = 123.45 Copy 3: f(x,y) = 123.96 divide-by-zero error ø 123.45 123.96 DM: Majority tolerace Δ= 0.75 123.96-123.45 = 0.51<0.75 123.45
NCP example Upo etry to NCP, the executive seds the iput, (1e -10, 2.2), to the three DRAs to be re-expressed expressed The DRAs ru their re-expressio expressio algorithms o the iput yieldig the followig re-expressed expressed iputs: DRA 1 (1e -10, 2.2) = (1e -10, 2.2) Pass-through DRA DRA 2 (1e -10, 2.2) = (0.002+1e -10, 2.2) DRA 3 (1e -10, 2.2) = (0.001+1e -10, 2.2) The executive gathers the re-expressed expressed iputs, formats calls to the =3 copies ad through those calls distributes the re-expressed expressed iputs to the copies Each copy, C i (i=1,2,3), executes The results of the copy executios (r( i, i=1, ) are gathered by the executive ad submitted to the DM The DM examies the results. The adjudicated result is 123.45 (radomly selected from those copy results matchig withi the tolerace) Cotrol returs to the executive The executive passes the correct result, 123.45, outside the NCP,, ad the NCP module is exited
NCP issues ad discussio The fault tolerace of a system employig data diversity depeds upo the ability of the DRA to produce data poits outside a failure regio, give a iitial data poit that is withi a failure regio Oe way to improve the performace of NCP is to use DMs that are appropriate for the problem solutio domai (e.g., cosesus votig)
Two-pas adjudicators (TPA) Combiatio of data ad desig diverse software fault tolerace techiques Also a combiatio of static ad dyamic techiques The hardware fault tolerace architecture related to the techique is N-modular N redudacy Uses DM, ad both forward ad backward recovery Operates like NVP uless ad util the DM caot determie a correct result give the variat results
TPA operatio Pass 1: ru Variat 1 (origial iput), Variat 2 (origial iput),, Variat (origial iput) if (Decisio Mechaism (Result(Pass 1, Variat 1), Result(Pass 1, Variat 2),, Result(Pass 1, Variat ) retur Result else Pass 2: ru DRA 1, DRA 2,,, DRA ru Variat 1(result of DRA 1), Variat 2(result of DRA 2),, Variat (result of DRA ) if (Decisio Mechaism (Result(Pass 2, Variat 1), Result(Pass 2, Variat 2),, Result(Pass 2, Variat ) retur Result else failure exceptio
Abbreviatios V i : Variat i, i=1,,, DM: Decisio mechaism DRA i : Data re-expressio expressio algorithm i : The umber of variats TPA: Two-pass adjudicator Rk i : Result of Vi for Pass k, i=1,,;,; k=1,2 x: Origial iput y i : Re-expressed expressed iput, y i = DRA i (x), i=1,,,
TPA structure ad operatio TPA etry Clear re-expressio flag, store ad distribute iputs Variat 1 Variat 2 Variat Gather results Re-express iputs, set re-expressio flag Yes Perform postexecutio adjustmet of results, if ecessary Formal majority voter Formal majority voter No Noe selected Output selected Data reexpressed? Yes No Multiple correct or icorrect results TPA exit Failure exceptio
Failure-free operatio Upo etry to the TPA, the executive sets the re- expressio flag to 0 (idicatig that these iputs are origial), stores the origial iputs, formats calls to the variats, ad through those calls distributes the iputs Each variat, V i, executes. No failures occur durig their executio The results of the Pass 1 variat executios (R 1i, i=1,,),) are gathered by the executive ad submitted to the tolerace DM The R 1i are equal to oe aother, so the DM selects R 12 (radomly, sice the results are equal), as the correct result Cotrol returs to the executive The executive passes the correct result outside the TPA, ad the TPA module is exited
Partial failure sceario icorrect results o first pass The R 1i differ sigificatly from oe aother. The DM caot determie a correct result, ad it sets a flag idicatig this fact Cotrol returs to the executive. The executive checks the re-expressio expressio flag to see if the iputs have bee re-expressed. expressed. They have ot The executive retrieves the stored iput, x, seds it to the DRA D to be re-expressed expressed (via a exact DRA), ad sets the re-expressio expressio flag to 1 (idicatig the iput has bee re-expressed) expressed) The DRA ru their re-expressed expressed algorithms o x, yieldig the re-expressed expressed iputs y i = DRA i (x) The executive gathers the re-expressed expressed iput, formats calls to the variats ad through those calls distributes the re-expressed expressed iputs to the variats Each variat, V i, executes The results of the Pass 2 variat executios (R 2i, i=1,,),) are gathered by the executive ad submitted to the tolerace DM The R 2i are equal to oe aother, so the DM selects R 23 (radomly, sice the results are equal), as the correct result Cotrol returs to the executive The executive passes the correct result outside the TPA, ad the TPA module is exited
Failure-sceario icorrect results o both passes The R 1i differ sigificatly from oe aother. The DM caot determie a correct result, ad it sets a flag idicatig this fact Cotrol returs to the executive. The executive checks the re-expressio expressio flag to see if the iputs have bee re-expressed. expressed. They have ot The executive retrieves the stored iput, x, seds it to the DRA D to be re-expressed expressed (via a exact DRA), ad sets the re-expressio expressio flag to 1 (idicatig the iput has bee re-expressed) expressed) The DRA ru their re-expressed expressed algorithms o x, yieldig the re-expressed expressed iputs y i = DRA i (x) The executive gathers the re-expressed expressed iput, formats calls to the variats ad through those calls distributes the re-expressed expressed iputs to the variats Each variat, V i, executes The results of the Pass 2 variat executios (R 2i, i=1,,),) are gathered by the executive ad submitted to the tolerace DM The R 1i differ sigificatly from oe aother. The DM caot determie a correct result, ad it sets a flag idicatig this fact Cotrol returs to the executive. The executive checks the re-expressio expressio flag to see if the iputs have bee re-expressed. expressed. They have bee The executive raises a exceptio ad the TPA module is exited
TPA ad MCR TPA is origially developed to hadle MCR Three coditios from which MCR arise Applicatios correctly resultig i multiple solutios Use of fiite-precisio arithmetic The existece of the CCP The TPA techique s s set of solutios Provides a solutio to the MCR problem Yields a higher probability tha the NVP majority voter of detectig ad selectig correct results icludig MCR Is relatively simple ad easy to uderstad ad implemet
MCR solutio category matrix System Type With History MCR Case Without History With Nocoverge t States With Coverget States Applicatio has MCR I II III Fiite-precisio IV V VI CCP VII VIII IX
Category I MCR case matrix System Embedded Close Use fiite-precisio detectio techiques Results Not Close/Distict Igore the curret frame. No eed to distiguish Stad-aloe aloe Use fiite-precisio detectio techiques Need to distiguish. Use data diverse techique
TPA example t=2.7 1 t=1.3 5 t=2.6 t=4.1 t=2.0 2 t=3.6 t=2.0 4 3 t=1.5
Routes meetig problem requiremets Route Cities Visited Total Time Commet A 1-2-3-4-5 7.4 MCR B 1-2-4-4-3 7.4 MCR C 1-3-5-4-2 9.7 D 1-3-2-5-4 10.2 E 1-4-5-2-3 10.7 F 1-4-3-2-5 9.6 G 1-5-2-3-4 8.2 H 1-5-4-2-3 8.8
TPA example Cosider the case i which oe of the variats fails ad the resultig decisio vector is (A,B,C) Suppose the TPA category I detectio techique is used Data re-expressio expressio yields a iput domai outside the failure regio of variat 3 If data re-expressio expressio does ot yield a correct result ad result is (A,B,D) If user is i-the the-loop, results of each pass are output to the user, ad user ca see that it is likely that A ad B are correct If the user is ot i the loop, the a time-rage check o preferred maximum travel time ca be used Aother meas of selectig amog MCR is the use of a priori preferece or utility iformatio