An Efficient Fault-Tolerant Multi-Bus Data Scheduling Algorithm Based on Replication and Deallocation

BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volue 16, No Sofa 016 Prnt ISSN: 1311-970; Onlne ISSN: 1314-4081 DOI: 10.1515/cat-016-001 An Effcent Fault-Tolerant Mult-Bus Data Schedulng Algorth Based on Replcaton and Deallocaton Chafk Arar 1, Mohaed Salah Khreddne 1 Coputer Scence departent, Batna Unversty, Batna, 05000 Algera Departent of Electroncs, Batna Unversty, Batna, 05000 Algera Eals: chafk.arar@gal.co khreddne@yahoo.fr Abstract: The paper proposes a new relable fault-tolerant schedulng algorth for real-te ebedded systes. The proposed schedulng algorth takes nto consderaton only one bus fault n ult-bus heterogeneous archtectures, caused by hardware faults and copensated by software redundancy solutons. The proposed algorth s based on both actve and passve backup copes, to nze the schedulng length of data on buses. In the experents, ths paper evaluates the proposed ethods n ters of data schedulng length for a set of DAG bencharks. The experental results show the effectveness of our technque. Keywords: Fault-tolerance, schedulng, real te systes, actve and passve redundancy, replcaton, deallocaton. 1. Introducton Nowadays, our socety becoes ncreasngly dependent on heterogeneous, dstrbuted, ebedded and real-te systes, whch take over even exceptonally crtcal decsons. These systes are ncreasngly becong ore coplex and ore senstve to faults, due to potentally catastrophc consequences that could result fro a alfuncton of these systes, fault tolerant technques are requred to ensure that these systes contnue to provde a correct servce n spte of faults [1-4]. The hardware thus as the software of a syste, can be the target of a varety of faults wth dfferent causes; we concentrate on hardware faults and especally councaton faults. We can defne fault tolerance as a syste's ablty to contnue operatng as planned, despte the presence of faults. There are dfferent ways to acheve fault tolerance. A coon one s that soe redundancy (such as re-executon and 69

N-verson prograng) or a knd of recovery actons s bult nto the syste. However, fault tolerance ncreases coplexty and ay lead to perforance degradaton f appled n an artless way. As we target ebedded systes, due to ther lted resources (due to space, weght and cost consderatons); t s possble to provde space redundancy. Ths s why we study only te redundancy solutons. Several fault-councaton tolerance approaches for dstrbuted ebedded real-te systes have been proposed. These technques are based on actve or passve backup ethods. In the actve backup schee, dfferent copes of the essage are sent along dstnct buses. In [5] authors develop a fault-tolerant allocaton and schedulng ethod, whch aps essages onto a low-cost ultple-bus syste to ensure predctable nter-processor councaton. In [6], the relablty of the syste can be ncreased by provdng several paths fro source to destnaton and sendng the sae packet through each of the (the algorth s known as ultpath routng), authors use ths dea to propose a new echans that enables the trade-off between the aount of traffc and the relablty. On the other hand, n the passve backup schee only the prary copy of the essage s sent; f t fals, another copy (backup) of the essage, wll be transtted. In [7], authors provde a generc algorth, based on replcaton of operatons and data councatons, whch solves the proble of off-lne fault tolerant schedulng of an algorth onto a ultprocessor archtecture. They take nto account two knds of falures: fal-slent and osson. In [8], the authors propose a synthess-based desgn ethodology, whch ncorporates foral valdaton technques, and releves the desgners fro the burden of specfyng detaled echanss for addressng platfor faults, whle nvolvng the n the defnton of the overall fault-tolerance strategy. In [9], authors survey the proble of how to schedule tasks n such a way that deadlnes contnue to be et despte processor and councaton eda (peranent or transent) or software falure. In [10], authors propose a new ethod to dentfyng bus faults based on support vector achne. The proposed ethod operates on two stages, frst, the bus fault state s sulated usng PSCAD/EMTDC, then a support vector achne odel s establshed for carryng out data pre-treatent. In [], both, actve redundancy and a TDMA (Te Dvson Multple Access) councaton protocol s used to tolerate faults of buses. In [11], authors propose a fne graned transparent recovery, where the property of transparency can be selectvely appled to processes and essages. In [1] authors propose a QoS-aware dynac fault-tolerant schedulng algorth called QAFT that can tolerate a node's peranent falures at one te nstant for real-te tasks. In ths paper, we are nterested n approaches based on schedulng algorths, ore specfcally those based on statc schedulng that allows for the ncluson of the dependences and executon cost of tasks and data dependences n ts schedulng decsons, and the schedule s already coputed at cople-te. The an obectve s to nse the schedulng length of data on buses, whch s the total sendng te of data, under the assupton that at ost only one bus ay fal. The basc dea of our work, whch s the cobnaton of actve and 70

passve redundancy n the sae schee, was orgnally proposed by [4], for processors fault tolerance, what we propose s ts adaptaton for councaton fault tolerance. For that, any transforaton and redevelopent are needed. Frst, we start by outlnng the defnton of the schedulng proble as an optzaton proble. We use the lnear prograng to forulate the optzaton proble of the fault-tolerant schedulng data wth two types of backup copes, to nse the schedulng length. It provdes the best results, but snce the proble s NP-hard, ths soluton generally takes a long te to obtan an optal soluton, and for soe cases we cannot fnd a feasble soluton n an acceptable te. To overcoe ths proble, we propose our soluton, based on a heurstc algorth called: Fault-Tolerant ult-bus data schedulng Algorth based Replcaton and Deallocaton (FTA-RD). The as of ths algorth are twofold, frst, axze the relablty of the syste; secondly, nze the length of the whole generated schedule n both presence and absence of faults. We are able to show wth sulaton results that our approach can generally reduce the run-te overhead. The reander of ths paper s structured as follows: n Secton, we gve detaled descrpton of our syste odels and backup copes types. In Secton 3, we ntroduce and dscuss our approach wth a otvatonal exaple, whch shows how our approach can nze the length of the whole generated schedule. Secton 4 present our soluton and gve a detaled descrpton of our schedulng algorth. In Secton 5, we present the experents. We fnally conclude ths work n Secton 6.. Proble defnton.1. Syste odels descrpton In ths secton, we frst gve soe defntons that descrbe our syste and then we defne the proble for fault-tolerant schedulng forally. The specfcaton of ths syste nvolves the descrpton of tasks and data odels, archtecture odel and fault odel..1.1. Task odel The task odel s defned by a Drected Acyclc Graph (DAG) noted G ( T, E,Exe ) T t, t,, represents a set of n tasks; E s a set task, where: task 1 t n of drected edges represents the task dependences, where an edge fro a task t to a task t noted by t t eans that task t depends on the output of task t ; Exe task ( t ) s a functon that calculates the executon cost of task t T. Fg. represents an exaple of a task odel..1.. Archtecture odel The archtecture s odelled by a non-drected graph, noted Garch ( P, B), where each node s a processor, and each edge s a eda councaton (bus). We 71

assue that the archtecture s heterogeneous and fully connected. Fg. 1 shows an exaple of an archtecture odel..1.3. Data odel The data odel s odelled by another DAG noted Gdata ( M, P,Exe data ). The graph G data s generated fro the graph G task wth a transforaton that respects the data precedence, M 1,,, l represents a set of all data transferred between tasks, the cardnalty of M s equal to that of E, P s a set of drected edges and represents the precedence relatonshps, where an edge fro a data to a data noted by eans that data requre to be calculated, Exe data ( ) s a functon, represents the transfer cost of data M and the te requred to run a falure-detecton routne that deternes whether the data was receved successfully or not..1.4. Falure odel We assue only buses faults. Each bus ay fal due to hardware fault. The faults can be transent or peranent and are ndependent. It s assued that at ost one bus wll fal to execute data transfer, n our proposed algorths. We call t one-bus falure odel. There exsts a fault-detecton echans such as fal-sgnal and acceptance test to detect the bus falure... Backups copes..1. Replcated backup copy Rep A Replcated backup copy of data s an actve backup copy, whch s sent Pr ndependently, no atter whether the Prary copy was receved successfully or not. In the case that the prary copy fals to reach ts destnaton properly, the Rep replcated copy can be used nstead of the prary. For exaple, n Fg. 5, 1 s a replcated backup copy of 1 scheduled on bus B.... Deallocated backup copy A Deallocated backup copy of data s a passve backup copy, whch s sent Pr only f the prary copy fals. The deallocated copy cannot be scheduled to start untl the prary copy was copletely sent and the actvaton essage was receved. An actvaton essage A s a essage orgnatng fro a prary essage copy to ts deallocated backup copes, whch ndcates whether successfully or not. We use was sent A essage to denote the te cost of the essage. 7

The fact of scheduler several deallocated backups at the sae te on the sae bus, s called backup overlappng. The pleentaton of backup overlappng s under the assupton that at ost, there s one bus falure to be tolerated so that no ore than one backup wll run at any te. For exaple, n Fg. 5, s a deallocated backup copy of scheduled n bus B, and overlappng wth another deallocated backup copy 6. 3. Illustratve exaple In ths secton, we provded an exaple to llustrate the proble that we try to solve. The archtecture odel of our syste s coposed of three processors fully connected wth three buses (as t s shown n Fg. 1), we assue that there wll be at ost one bus fault. Fg. 1. Archtecture odel Fg.. Task odel Fg. presents the tasks odel of our exaple, wth nne tasks t 1, t, t 3, t 4, t 5, t 6, t 7, t and 8 t 9. The edges show the task dependences. In the graph, for nstance, task t 4 can t start to be executed untl task t 1 was executed successfully. Fg. 3 presents the data odel generated fro the tasks odel of Fg.. The data odel regan and respects the tasks dependences. In the graph, for nstance, data 3 and 4 can t be sent untl data 1 was successvely receved, because n 73

tasks odel of Fg., task t 4 need 1 to calculate 3 and 4. For ths exaple, we assue that the te delay for the actvatng essage Aessage s 1 te unt. Fg. 3. Data odel The proble to solve s fndng an optal schedule wth a nzed length for these data and ther backup copes, so that all data can successfully be transtted, assung that only one bus ay fal. Fg. 4 presents an exaple of a non fault-tolerant schedulng, the length of the schedulng for ths case s equal to 8. Fg. 4. No fault-tolerant data schedulng In the case of our exaple, the advantage of cobnng both replcaton and deallocaton n the sae algorth s shown by three knds of optal schedules to tolerate one bus fault (Fgs 5, 6 and 7). All the optal results can be obtaned by the lnear prograng forulaton presented n Secton 4. 74

Fg. 5. Optal fault-tolerant schedule wth both replcated and deallocated backup copes An optal fault-tolerant schedule wth both replcated and deallocated backup copes for our exaple s shown n Fg. 5. The nal schedulng length s 9 te unts. In ths schedule we choose the deallocated backup copes for data, 6 and replcated backup copes for the rest of data. In ths schedule, the deallocated backup copy of data s overlappng wth that of data 6 at Steps 7 and 8 on bus B to reduce the schedulng length. Fg. 6. Optal fault-tolerant schedule wth only replcated backups copes An optal fault-tolerant schedule wth only replcated backup copes, wth a length equal to 10 te unts, s shown n Fg. 6. An optal fault-tolerant schedule wth only deallocated backup copes s show n Fg. 7, the schedulng length s 17 te unts. In ths schedule, 1 are overlappng at Steps 4 and 5 on bus B, 3 overlappng at Steps 10 and 11, on the bus 1 A essage between data and ts deallocated backup copes copy s sent successfully, the actvaton essage backup copy. and 4 and are also B. There s at least 1 te unt delay for Aessage. If the prary cancels ts deallocated 75

Fg. 7. Optal fault-tolerant schedule wth only deallocatted backups copes In ths exaple, the optal schedule wth both deallocaton and replcaton has 47.05% reducton over deallocaton only n schedulng length, when he loses only 1.5% copared to a non-fault tolerant soluton. And t also reduces the schedulng length by 10% coparng wth the optal schedule wth replcaton only. The perforance s proved sgnfcantly. 4. The proposed approach In ths secton, we defne the schedulng proble as an optzaton proble and we use the lnear prograng to fnd the optal schedulng of data dependency wth ther backup copes to tolerate one-bus fault, wth nal length. As our soluton s based on a hybrd approach that cobnes both passve and actve redundancy we use two types of backup copes, replcated and deallocated. We odel data dependences schedulng wth a bnary varables to deterne the order of data. M s a bnary varable such that Pr,, t prary copy of data bnary varables copes. (1) M and Rep,, t was successfully sent on bus,, t M Pr,, t = 1 f and only f the B at step t. Slarly, the M are used for the replcated and deallocated backup 1 1 bus 1 sch Pr Rep M,M,M 01,,,, n,,,n t,,l,,t,,t,,t, L sch s an upper bound of the schedulng length. The obectve functon of our lnear proble s the nzaton of the schedulng length, whch can be atheatcally forulated as () Lsch nbus Mnze tm, t 1 1 Pr out, t, where out s a fcttous node added to data odel (DAG), to copute total schedulng length (Fg. 8). 76

Fg. 8. The fcttous node add to the DAG The nzaton of the total length of the schedulng s gven under the followng constrants: Data appng constrant. The prary copy of each data s scheduled once and only once, (3) Lsch nbus 1,, n M 1. t 1 1 Pr,, t The backup copy of each data s scheduled once and only once, (4) Lsch nbus 1,, n M M 1. t 1 1 Rep, t,, t, At any te, a bus B x s used ether for the transsson of a prary copy of data or ts replcated backup copy, Pr Rep (5) t Lsch k nbus M, t, M, t, 1,,, 1,, 1. In a ult-bus syste, to tolerate one-bus fault, we can use only one backup copy for each data dependency. Ths backup copy can be ether replcated or deallocated. Dependency constrant. Backup copes ust eet the sae precedence relatonshps as ther prary copes. k P eans that data k requre data cannot be send untl data was receved and used to calculate (6) n 1 to be calculated; data k : Lsch nbus Lsch nbus Rep Rep, t, k, t, t 1 1 t 1 1, k 1,, n t* M Exe( ) t* M. For each k P, a new drected acyclc graph that represents all dependences between possble backup copes and ther prary copes s generated. Fg. 9 presents the part of DAG for. k k 77

Fg. 9. The new DAG for k Fg. 10 presents the coplete DAG for the data odel n Fg. 3. Fg. 10. The coplete DAG for data odel of our exaple But for our exaple we have chose deallocated backup copes for and 6, and replcated backup copes for 1, 3, 4, 5 and 7, so Fg. 11 presents the reduced DAG respectng ths choce. Fg. 11. The reduced DAG Fault tolerant constrant: the prary copy and ts backup copy should not, n no case, be assgned to the sae bus: 1,, n, 1,, n (7) Lsch t 1 Pr Pep, t,, t,, t, bus M M M 1. 78

Executon constrant: For the tasks dependences t x t t (as t s shown n Fg. 1) the data b can t be scheduled untl data a was successfully receved by task t y and ths latter was copletely executed, 1,, n t * M Exe ( t ) t ' M. (8) Pr Pr bus, t, task y, t ', y z Fg. 1. Executon constrant The sae can be sad for all other cobnatons, n our case we can count nne possbltes: (Pr, Pr), (Pr, Rep), (Pr, ), (Rep, Pr), (Rep, Rep), (Rep, ), (, Pr), (, Rep), (, ). Overlappng and resource constrant. The overlappng of two deallocated backup copes or ore ust respect the followng condtons: ther prary copes ust be assgned to dfferent buses and ust be copletely ndependent,.e., they have no dependency. Only, the deallocated backup copes, ay overlap. If there are two deallocated backup copes M,, t and M k,, t overlappng on the sae bus B, then ther prary copes are assgned to two others dfferent buses than B : (9) Lsch Pr Pr, t, k, t,, t, k, t, t 1, k 1,, n M M 1 M M 0, (10) Lsch Pr Pr bus, t, k, t, t 1, k 1,, n, 1,, n M M 0. The soluton obtaned s thus the best result, but snce the proble s NP-hard, ths forulaton generally takes a long te to obtan an optal soluton, and for soe cases we cannot fnd a feasble soluton n an acceptable te. That s why we propose our second soluton, based on a heurstc algorth called FTA-RD. The as of ths algorth are twofold, frst, axze the relablty of the syste; secondly, nze the length of the whole generated schedule n both presence and absence of faults. Our schedulng algorth s a greedy lst schedulng heurstc, whch schedules one operaton at each step. The nput to our algorth s an nstance of the tasks graph Gtask T E task (,, Exe ) and the archtecture graph G ( P, B), and the te cost of the actvate essage A essage arch ; t generates a dstrbuted statc schedule of a gven task odel onto a gven archtecture odel, whch nzes the syste's run-te, and tolerates one bus fault. It s obvous that the FTA-RD algorth's te coplexty s polynoal te. 79

The FTA-RD schedulng algorth s shown n Fg. 13. Fg. 13. The FTA-RD schedulng algorth 80

FTA-RD operates as follows: 1. Frst, t exanes the data dependences one by one to deterne when and whch bus each wll be sent to. It fnds the bus wth the earlest avalable te and the least used one (the use of a bus s easured by the total te of transsson of all data assgned to that bus); therefore, ths assgnent, so ade, ensures load balancng.. Once the prary copy s scheduled, the algorth attepts to schedule the backup copy, frst of all t tres to overlap wth an exstng deallocated backup copy, f not, t allocates a new replcated or a new deallocated backup copy accordng to the BACKuP. type 3. Fnally, to reduce the schedulng length, the backup type for the last data s redefned. 5. Sulatons, results and dscusson In ths secton, we present the result of sulatons, we copare the proposed schedulng algorth FTA-RD wth soluton based on lnear prograng forulaton, and solutons based only on replcaton or deallocaton, n schedulng length. We have appled the FTA-RD heurstc to an exaple of an archtecture graph coposed of fve processors and three buses. The falure rates of the processors are respectvely 10-5, 10-5, 10-4, 10-5 and 10-6, and the falure rate of the Buses SAM MP1, SAM MP, and SAM MP3 are respectvely 10-6, 10-5 and 10-4. The algorths graphs used are those of DSP benchark fro the DSPstone [13], the nuber of tasks and data dependences n each benchark s lsted n Table 1. Table 1. Bencharks Inforaton Benchark I Tasks Data-dependences FIR flter 1 8 -otv 1 9 r flter 17 13 -deq flter 4 0 IIR bquad secton 36 31 -rls-lat 8 33 8latr 46 38 We use Sulaton Tool for Real te Multprocessor schedulng (STORM) to calculate the data length. Fro the specfcaton of the characterstcs of software archtecture (the tasks to schedule), hardware archtecture (the resources for pleentng these tasks) and the choce of a schedulng polcy, the tool sulates the executon of these tasks on these resources accordng to the rules of ths polcy [14]. Fg. 14 shows the task odel and the allocaton of buses n STORM edtor. 81

Fg. 14. Task odel and bus s allocaton Results show that the FTA-RD heurstc wth both the replcated and deallocated backup copes perfors better than the one wth replcated and deallocated heurstcs only, n all bencharks. We can see that FTA-RD heurstc can reduce the schedulng length by 6.19% on an average when copared to onlyreplcaton heurstc (Fg. 15), and 19.9% on average when copared to onlydealocaton heurstc (Fg. 16). We can see also that our heurstc loses 8.41% on average n the schedulng length copared to the optal soluton obtaned by lnear prograng forulaton (Fg. 17). Fg. 15. Schedulng length of FTA-RD heurstc and only-replcaton heurstc Fg. 16. Schedulng length of FTA-RD heurstc and only-deallocaton heurstc 8

We can see fro the results that: When the nuber of data dependences s a sall one, the algorth that uses only the replcated backup copes s ore effcent than one that uses deallocated backup copes. Ths s explaned by the fact that when the nuber of data s low, the opportunty for backup overlappng s also low. When the nuber of data dependences s ore portant, ore deallocated backup copes can be overlapped, whch allows reducton of schedulng length. Fg. 17. Schedulng length of FTA-RD heurstc and LP-forulaton soluton Also, delay due to the actvaton essage, ay have a negatve effect on the schedule length. The use of both replcated and deallocated backup copes allows the best results, that n the case of the use of only one types of backup copes. The results obtaned by the lnear prograng forulaton are better than those obtaned by the FTA-RD heurstc, the proble s that n any cases, the calculaton te of ths forulaton s too long to produce results. The advantage of the FTA-RD heurstc s that t has a polynoal runnng te. 6. Concluson In ths paper, we have studed the proble of fault-tolerance n ebedded real-te systes and proposed a software pleented fault-tolerance soluton for ultbuses archtectures based on software redundancy. We have proposed a new schedulng heurstc, called FTA-RD, whch produces autoatcally a statc dstrbuted fault-tolerant schedule when data dependences are represented by drected acyclc graphs wth two types of backup copes under the assupton that there wll be at ost one bus fault. The sulatons show a sgnfcant proveent copared to algorths wth only one type of backup copy. Fnally, we plan to carry out an experent nvolvng our ethod on an electrc autonoous vehcle, wth a 5-processor ult-buses archtecture. 83

R e f e r e n c e s 1. J a l o t e, P. Fault Tolerance n Dstrbuted Systes. Prentce-Hall, Inc., 1994.. K o p t e z, H. Real-Te Systes: Desgn Prncples for Dstrbuted Ebedded Applcatons. Sprnger Scence & Busness Meda, 011. 3. G r ü n s t e d l, G., H. K a n t z, H. K o p e t e z. Councaton Relablty n Dstrbuted Real- Te Systes. In: Dstrbuted Coputer Control Systes 1991: Towards Dstrbuted Real- Te Systes wth Predctable Tng Propertes, 014, p. 13. 4. Jun, Z., E. Sha et al. Effcent Fault-Tolerant Schedulng on Multprocessor Systes va Replcaton and Deallocaton. Internatonal Journal of Ebedded Systes, Vol. 6, 014, No -3, pp. 16-4. 5. K a n d a s a y, N., J. P. H a y e, B. T. M u r r a y. Dependable Councaton Synthess for Dstrbuted Ebedded Systes. Lecture Notes n Coputer Scence, 003, pp. 75-88. 6. D u l a n, S., T. N e b e r g, J. W u et al. Trade-Off between Traffc Overhead and Relablty n Multpath Routng for Wreless Sensor Networks. IEEE, 003. 7. D a, C., A. G r a u l t, C. L a v a r e n n e et al. Off-Lne Real-Te Fault-Tolerant Schedulng. In: Proc. of 9th Eurocro Workshop on Parallel and Dstrbuted Processng, 001. IEEE, 001, pp. 410-417. 8. P n e l l o, C., P. C. L u c a, L. S.-V. A l b e r t o. Fault-Tolerant Deployent of Ebedded Software for Cost-Senstve Real-Te Feedback-Control Applcatons. In: Proc. of Conference on Desgn, Autoaton and Test n Europe-Volue. IEEE Coputer Socety, 004, p. 1164. 9. K r s h n a, C. M. Fault-Tolerant Schedulng n Hoogeneous Real-Te Systes. ACM Coputng Surveys (CSUR), 014, Vol. 46, No 4, p. 48. 10. S o n g, H., H. W u. The Appled Research of Support Vector Machne n Bus Fault Identfcaton. In: Proc. of 6th Internatonal Conference on Natural Coputaton (ICNC), 010, IEEE, 010. pp. 136-139. 11. I z o s o v, V., P. Pop, P. Eles et al. Schedulng and Optzaton of Fault-Tolerant Ebedded Systes wth Transparency/Perforance Trade-Offs. ACM Transactons on Ebedded Coputng Systes (TECS), Vol. 11, 01, No 3, p. 61. 1. Zhu, X., X. Qn, M. Qu. QoS-Aware Fault-Tolerant Schedulng for Real-Te Tasks on Heterogeneous Clusters. Coputers, IEEE Transactons on, Vol. 60, 011, No 6, pp. 800-81. 13. Z v o n o v c, V., J. M. V e l a r d e, C. S c h l a g e r et al. DSPstone: A DSP-Orented Bencharkng Methodology. In: Proc. of Internatonal Conference on Sgnal Processng Applcatons and Technology, 1994, pp. 715-70. 14. RTS Group, STROM. IRCCyN Laboratory n Nantes, France. http://www.rts-software.org/ 84