Fault tolerant scheduling in real time systems

tolerant scheduling in real time systems Afrin Shafiuddin Department of Electrical and Computer Engineering University of Wisconsin-Madison shafiuddin@wisc.edu Swetha Srinivasan Department of Electrical and Computer Engineering University of Wisconsin-Madison srinivasan9@wisc.edu Abstract tolerance in uniprocessor systems is usually handled by adding time redundancy in the schedule so that any task instance can be re-executed in presence of faults during the execution. In this paper, a comparison between tolerant versions of Earliest Deadline First (EDF) scheduling policy and a proposed scheduling algorithm based on Round Robin scheduling and shortest remaining time first scheduling for periodic real-time tasks is presented. This scheme can be used to tolerate transient faults during the execution of tasks. For performance evaluation of this idea, a tool is developed. Keywords Real-Time Systems, Periodic Task Scheduling, Uniprocessor, -Tolerant Scheduling, Earliest Deadline, time redundancy; I. INTRODUCTION Real time systems are systems in which there is a commitment for timely response by the computer to external stimuli. The correctness of the system depends not just on the logical result but the time at which the results arrive. Real-Time systems often have deadlines to be met during their operation. For instance, a missed deadline in hard real-time systems is catastrophic and in soft real-time systems it can lead to a significant loss. Predictability of the system behavior is the most important concern in these systems. Predictability is often achieved by either static or dynamic scheduling of real-time tasks to meet their deadlines. When the real time system is a hard one, being used for critical applications like Air Traffic Control, Control systems etc., tolerance becomes an integral part of the Real time system design. Different types of faults can occur in a Real time system like Transient faults, Permanent s, Intermittent faults. Permanent faults may be due to irreparable damage to the hardware etc., whereas transient faults can result from temporary environmental conditions. Intermittent faults are those that are only occasionally present due to unstable hardware or varying hardware or software states(e.g. as a function of load or activity). Transient faults and intermittent faults are the major source of system errors in real time systems. It is shown that transient faults are 30 times more frequent than permanent faults. Transient faults are generally tolerated by using time redundancy which involves re-execution of any task running during the occurrence of transient faults. In this paper, we are mainly concerned with providing fault tolerant real time scheduling to handle transient faults. We achieve this by adding enough and efficient time redundancy to the scheduling process to enable it to re-execute any faulty task and still be able to achieve a high utilization. II. A. Task TASKS,SYSTEM AND FAULT MODEL The basic component of scheduling is a task. A task is unit of work such as a program or code-block that when executed provides some service of an application. Examples of task are reading sensor data, a unit of data processing and transmission, etc. A periodic task system is a set of tasks in which each task is characterized by a period, deadline and worst-case execution time (WCET). Each task in periodic task system has an interarrival time of occurrence, called the period of the task. In each period, a job of the task is released. A job is ready to execute at the beginning of each period, called the released time, of the job. Each job of a task has a relative deadline that is the time by which the job must finish its execution relative to its released time. The relative deadlines of all the jobs of a particular periodic task are same. The absolute deadline of a job is the time instant equal to released time plus the relative deadline. Each periodic task has a WCET that is the maximum execution time that that each job of the task requires between its released time and absolute deadline. If the relative deadline of each task in a task set is less than or equal to its period, then the task set is called a constrained deadline periodic task system. If the relative deadline of each task in a constrained deadline task set is exactly equal to its period, then the task set is called an implicit deadline periodic task system. If a periodic task system is neither constrained nor implicit, then it is called an arbitrary deadline periodic task system. In this paper, scheduling of implicit deadline periodic task system is considered. The scheduling of n implicit-deadline periodic tasks in set ר = {τ1, τ2,..., τn} is considered. Each of the tasks τ i in set {τ1, τ2,..., τn} is characterized by a pair (C i, T i), where C i represents the WCET and T i is the period of task τi. Each task τi is released and ready for execution at the beginning of each period T i and requires at most C i units of execution time before next period. The relative deadline of a task τi is equal to its period T i, that is, ר is an implicit deadline task system. 1

III. EARLIEST DEADLINE FIRST B. Utilization The load or utilization is defined as the fraction of the processor executing the task. The load or utilization of a task τi is denoted by u i = C i/t i. The total load or total utilization of any task set A is U (A) = P i A u i. For example, the total utilization of the task set ר is U,(ר) which is the total load of the task set. C. System Model The system model considered in this paper is a uniprocessor system. The tasks are scheduled to execute one by one in this system. To allow overriding of tasks, preemption is allowed. A scheduling algorithm is preemptive if the release of a new job of a higher priority task can preempt the job of a currently running lower priority task. During runtime, task scheduling is essentially determining the highest priority active tasks and executing them in the processor. Tasks are assumed to be independent, that is, there is no resource sharing except for the processor. The cost of a preemption and context-switch is assumed to be negligible. A. Description In recent years, Earliest-Deadline-First (EDF) scheduling policy has been used to schedule real-time tasks in variety critical applications. Earliest deadline first is a dynamic priority driven scheduling algorithm which gives tasks priority based on deadline. The tasks are assumed to be independent, periodic tasks with implicit deadlines. The task with the currently earliest deadline during runtime is assigned the highest priority. That is, if a task is currently in execution, and a new task arrives that has an earlier deadline, then the new task gets the higher priority and therefore preempts the currently running task. EDF is an optimal dynamic priority driven scheduling algorithm with preemption for a real-time system on a uniprocessor. EDF is guaranteed to schedule periodic tasks with implicit deadlines and 100% utilization. If a task set T i is represented as (C i, T i) where C i is the Worst case Computation time and T i is the period or deadline of the task then the schedulability equation for EDF in terms of its utilization U is given as D. Model Designing fault-tolerant scheduling algorithm needs to guarantee that all tasks deadlines are met when faults occur even under the worst-case load condition. No fault-tolerant system can, however, tolerate an arbitrary number of faults within a particular time interval. The scheduling guarantee in fault-tolerant system is thus given under the assumption of a certain fault model. In this project, the fault model mainly assumes tolerating the faults due to which the error is transient. It is assumed that transient faults are short lived and would not reappear when re-executing the same task. If the effect of faults in software is manifested as transient error that would not reappear upon re-execution, then such faults can be tolerated using simple re-execution of the task. For example, due to changes in the environment or changes in the input parameters, the execution path a software takes could be different from one execution to another. In such case, it is expected that the same error would not occur (since a different execution path is taken) if the software is simply re-executed. Time redundancy is considered in this paper for tolerating transient faults. s are assumed to be detected at the end of execution of a task. When fault occurs during execution of a task and error is detected, the faulty task is re-executed. When a task is executed for the first time, it is called the primary copy of the task. After an error is detected, the re-execution is called the recovery copy of the task. The re-execution of the task is activated when an error is detected. It is also assumed that the error-detection and fault-tolerance mechanisms are themselves fault-tolerant. The faults are assumed to have occurred only once every hyper period and so only one task can be at fault at a time. In summary, the fault model considered in this paper has reasonable representativity and very general to tolerate a variety of transient faults in hardware/software. The above is called the Liu and Layland [1] bound of schedulability. However, EDF scheduling does not guarantee fault tolerance. It does not add any time redundancy to the schedule. When a task suffers a transient fault, adding time redundancy will provide enough time for the faulty task to reexecute so that it can complete before its deadline. Therefore, it is necessary to add appropriate and efficient time redundancy to the EDF scheduling policy for schedule periodic and preemptive tasks. In the EDF policy with utilization less than 100%, there is a natural amount of slack in uniprocessor. But this natural slack is not enough for re-executing faulty tasks. To have an efficient fault-tolerant mechanism in the schedule it is necessary that additional slack time is added to the schedule. The recovery mechanism ensures that the reserved slack can be used for task re-executing before its deadline, without causing other tasks to miss deadlines. B. tolerance in EDF In general, fault tolerance in uniprocessors is achieved by reexecuting the faulty task. Tasks are executed as per the EDF scheme. The priorities are dynamically determined i.e. the task with the shorter absolute deadline has the highest priority. When a fault occurs in the system, the system goes into fault recovery where the already executed task is added back to the schedule as if it has just arrived and the priorities are calculated once again. The tasks are again dynamically scheduled so the task with the highest priority will execute and so the other tasks will be delayed. If the utilization of the tasks was nearly 100% there will not be enough slack to re-execute the tasks again. Therefore a slack needs to be added to the schedule. This slack or time redundancy will make sure that the task has enough time to re-execute without missing its deadline. But adding this time redundancy will also reduce the utilization. 2

C. Time redundancy in EDF Increasing the value of time redundancy has a considerable impact on the efficiency of scheduling and recovering tasks. Adding less time redundancy implies more tasks can be scheduled and utilization will be high but this will also lead to lesser tasks that can be recovered after a fault. Adding more time redundancy will make recovery easier but will lead to lower utilization. Hence, it is important to find efficient value of time redundancy. For the fault model considered in this paper, i.e, only a single fault will occur in a hyper period, the time redundancy can be considered to be the worst WCET of the task set. This allows us to recover the task with the worst WCET and also other tasks that have a lower WCET. D. Utilization The utilization now involves adding the C i/t i of all tasks along with twice the C i/t i of the WCET of the longest task. Hence the schedulability bound will be calculated as Where C lng and T lng correspond to the WCET and period/deadline of the task with worst WCET. The task set can be scheduled if the actual utilization of the task set is U = 1-. Here, is the backup utilization U B of the system. Therefore, according to Liu and Layland [1], the schedulability bound for fault-tolerant EDF becomes 1-U B. E. Typical Execution Consider the task set (C i/t i) given as T1 (1, 5), T2 (5, 7). This task set has an actual utilization of U=0.825 and so will be schedulable. The schedule without fault tolerance is given in Fig.1a. The tasks are represented as T ij were i is the task ID and j is the instant of the task. If a fault were to occur at time instant 12, the task T 12 can be re-executed at time 13 and still complete before its deadline at time 15. But if a fault were to occur when the longer task T 21 is executing i.e. at time 5 as shown in Fig.1a, the task has already executed 4 units and cannot re-execute as it will miss its deadline at time 7. schedule for the task is given in Fig.2a. Now if a fault were to occur at time 3 as shown in Fig.2b, the task T21 can still reexecute before its deadline. Hence, it is necessary to provide sufficient slack so that the longest task can be re-executed. Since as per the fault model only one fault can occur per hyper period only one task will have to be re-executed. IV. PROPOSED ALGORITHM A. Description The proposed algorithm combines the principle of Round Robin scheduling and shortest remaining time first scheduling. Shortest remaining time first (SRTF), is a scheduling method in which, the process with the smallest amount of time remaining until completion is selected to execute. Since the currently executing process is the one with the shortest amount of time remaining by definition, and that time should only reduce as execution progresses, processes will always run until they complete or a new process is added that requires a smaller amount of time. Shortest remaining time is advantageous because short processes are handled very quickly. The system also requires very little overhead since it only makes a decision when a process completes or a new process is added, and when a new process is added the algorithm only needs to compare the currently executing process with the new process, ignoring all other processes currently waiting to execute. In Round-robin (RR), time slices are assigned to each process in equal portions and in circular order. In order to schedule processes fairly, a round-robin scheduler generally employs time-sharing, giving each job a time slot or quantum, and interrupting the job if it is not completed by then. The job is resumed next time a time slot is assigned to that process. In order to increase the utilization of the task set and enable to schedule more tasks the Round Robin and shortest remaining algorithm are combined in the following way. Consider a task set of the form (C i, T i) where C i is the computation time and T i is the period of the task. The tasks are sorted in increasing order in terms of their shortest remaining time to deadline. The RR time slice is taken as the WCET of the shortest job. In this algorithm, each task executes in a RR fashion with the chosen time slice unless it is preempted by a task with a high urgency. The urgency factor for each task can be calculated as follows WCET j- CEU j TIME-T j =0 where WCET j is the Worst Case Execution Time, CEU j is the current executed units of the task so far, T j is the deadline and TIME is the current time instant. The task that satisfies this urgency is given the highest priority and preempts any other task currently executing. Fig.1 & Fig.2 Now consider a task set T1 (1, 5), T2 (4, 10) with utilization calculated by the tolerant EDF. (1/5) + 2(4/10) = 1. The B. tolerance in algorithm tolerance against transient faults is achieved by reexecuting the faulty task. For every fault occurring in a RR time slice the entire time slice is re-executed. The urgency of the task are dynamically determined i.e. the task that is most 3

urgent has the highest priority. When a fault occurs in the system, the system goes into fault recovery where the already executed task is added back to the schedule as if it has just arrived and the urgency factors of all tasks are calculated once again. The tasks are again dynamically scheduled in each time slice so the task with the highest urgency will execute. C. Checkpointing overhead The proposed algorithm assumes that the fault can be eliminated by re-execution of only the RR time slice due to checkpoint done at every RR slice time instant. There is only one task that executes during a single RR slice. Hence, the checkpoint overhead is assumed to saving the state of a single task during a RR time slice. If a task completes execution during a RR time slice, there is no need to save the state of that task. An arbitrary factor of 0.05 is taken into account to calculate the checkpoint overhead. The checkpoint overhead is given by This additional overhead is added to the total utilization and schedulability is determined. D. Time Redundancy in algorithm Similar to adding time redundancy in EDF scheme, an efficient value of time redundancy is also added to the proposed algorithm. In case of EDF, the longest WCET of the task was chosen as the slack needed for re-execution. In this algorithm, the shortest WCET time is chosen. Since the algorithm works in time slices where each RR time slice is equal to the shortest WCET of a task, in presence of a fault, only the time slice of the faulty task will have to be re-executed due to the check pointing done above. Therefore, it will be sufficient to add only the shortest WCET as the time redundancy. E. Utilization. The utilization is given by the adding the C i/t i ratio of all the tasks in addition to the checkpoint overhead.we also add twice of the C i/t i of the shortest executing task to provide enough slack to achieve fault tolerance. Hence the utilization will now become overhead and so the schedulability bound now becomes Where + checkpoint + CP overhead is the backup utilization (U B) of the system and C sh and T sh are the WCET and period of the shortest task.the task can be scheduled if the actual utilization of the system is U = 1 - ( + CP overhead). The schedulability bound for fault-tolerant algorithm is 1-U B. F. Typical Execution For example consider the task set (C i /T i) given as T1 (2, 8), T2 (5, 10). The actual utilization of the task set is U=0.75 and so is schedulable. Fig 2 The schedule using the proposed algorithm without any fault is shown in Fig.2.a.The tasks are sorted in terms of shortest remaining and so T2 will have the shortest remaining time and so is chosen to execute first. The task with shortest WCET is T1 with C i=2. The time slice of the RR will be 2 units. T2 is executed for 2 units and T1 will be executed for the next 2 units. Then once again T2 is executed for two units. In the next time slice since no other task other than T2 is available, T2 is executed to completion. Now, if a fault were to occur at time 5 as shown in Fig 2b, then the task T2 will have to be reexecuted for the RR time slice. The EDF schedule for the same task set is shown in Fig.2.c. It can be seen from the figure that if a fault occurred at time instant 5 then, the entire task T2 with computation time 5 units will have to be re-executed. V. SIMULATION Both the algorithms have been implemented in the C++ programming language. An event driven simulator is designed and implemented. A task vector is maintained by the simulator and the events are handled according to their order. The simulator can execute either of the algorithms by giving appropriate input vales. The simulation works as follows: * Scheduling of tasks: The Earliest-Deadline-First (EDF) scheduling policy or the proposed algorithm is used to schedule tasks. * injection: s are injected into the schedule while tasks are running. s are generated based on a value of MTTF specified by the user. * recovery: The recovery scheme used here is reexecution for tolerating transient faults. Determine the utilization and schedulability. If schedulable, Create an event by ordering the tasks Start time at 0. Loop while (time < Simulation End time) Execute the task every time instant for EDF or every RR time slice for the proposed algorithm 4

At every time (instant for EDF, slice for the proposed algorithm), order the tasks based on priority or urgency as explained above Check for faults at the end of every time instant If fault has occurred, re-execute the entire task for EDF or re-execute time slice for the proposed algorithm Use time redundancy as a recovery mechanism and re-execute the faulty task by adding it to the queue again. VI. RESULTS The results are tabulated as follows: The scenarios investigated are as follows: 1) Completed execution with Tolerance 2) Missed deadline due to faults - MD 3) Unable to Schedule - UTS All faults injected are at random time intervals by the simulator. Fig.3 & Fig.4 Scenario Name Actual utilizat ion Utilization Checkpoint Overhead 0.438 0.568 0.025 Single EDF 0.438 0.771 - Single 0.679 0.98 0.025 EDF 0.679 0.906 - Model (s per LCM) Task set Execution details fault (1,6,18), (2,2,19) (1,6,18), (2,2,19) (1,2,7), (2,5,22), (3,4,24) (1,2,7), (2,5,22), (3,4,24) With a utilization of 50%, the new algorithm is fault tolerant with the fault occurring at any instant during the execution. 342 EDF is fault tolerant when the fault occurs at a time when recovery (enough time redundancy to the deadline to allow re-execution) is possible since the utilization is 70% 342 The proposed algorithm is fault tolerant with double faults in the LCM. Consider two faults occurring at time 5 and 15, the algorithm is fault tolerant even though the utilization is 98%. This shows that the ratio of C i/t i also plays a significant in determining whether the execution of the task set will be fault tolerant or not. (See Fig.3a) 1848 EDF is fault tolerant with double faults in the LCM. Consider two faults occurring at time 5 and 15, the algorithm is fault tolerant but the tasks finish closer to their deadline due to re-execution of the whole task again. (See Fig.3b) 1848 Simulation time(lcm) 5

Scenario Name Actual utilizat ion Utilization 0.623 0.8166 0.05 MD EDF 0.623 0.89 - UTS Checkpoint Overhead Model (s per LCM) Task set Execution details fault 0.9 1.05 0.05 Single EDF 0.9 1 - Single UTS 0.79 1.11991 - Single fault MD EDF 0.79 0.995789 - Single fault MD UTS EDF 0.8333 3 0.991667 0.025 Single fault 0.8333 3 0.991667 0.025 Single fault 0.8333 3 1.33333 - Single fault (1,1,7), (2,6,22), (3,5,24) (1,1,7), (2,6,22), (3,5,24) (1,1,5), (2,1,5), (3,1,5), (4,1,5) (1,1,5), (2,1,5), (3,1,5), (4,1,5) (1,10,15), (2,7,25), (3,6,19) (1,10,15), (2,7,25), (3,6,19) (1,8,16), (2,2,15), (3,7,35) (1,8,16), (2,2,15), (3,7,35) (1,8,16), (2,2,15), (3,7,35) In Fig.4a, consider faults occurring at time instants 3 and 13, the proposed algorithm had enough slack to re-execute the time slice and still meet the deadline 1848 In Fig.4b, consider faults occurring at time instants 3 and 13, T3 had to miss its deadline due to re-execution of T2 twice. 1848 Unable to schedule the task set due to high checkpoint overhead involved. 5 This task set is schedulable by EDF though it failed for the proposed algorithm because of the extra overhead. 5 Unable to schedule the task set because the Ratio of C i/t i is very less for the task with minimum WCET 950 This task set is schedulable by EDF and for a fault occurring at time 9, the task 1 and 3 will miss their deadlines 950 For a occurring at the start of execution, at Time 2, the RR time slice is reexecuted for task 1 and it does not miss its deadline and is tolerant 1200 Task 1 missed deadline due to the fault encountered at Time 15, the RR time slice is 2, the task does not have enough slack for re-executing the RR time slice 1200 Unable to schedule EDF because twice the C i/t i of the task with the longest WCET is itself 1. 1200 Simulation time(lcm) 6

VII. CONCLUSIONS We used a diverse set of tasks to fully compare the functionality of EDF with the proposed algorithm. The results are summarized as follows: For fault tolerant EDF, the slack is the WCET of the longest executing task. This slack is enough to handle any faults that occur during other tasks as long as they are not close to their deadline. If a fault occurs right before the deadline, the task will miss its deadline. The proposed algorithm works well when the ratio between the Ci and Ti is large enough. Additionally, check pointing overhead might not allow for scheduling for task sets that will be schedulable by EDF. The proposed algorithm can guarantee fault tolerance as long as the fault occurs atleast one RR slice of execution before its deadline. REFERENCES [1] 2. C.Liu and J.Layland, "Scheduling algorithms for Multiprogramming in a Hard Real-Time Environment", Journal of the ACM, vol. 20, no.1, pp. 46-61, January 1973. [2] 3. Hakem Beitollahi, Seyed Ghassem Miremadi and Geert Deconinck, -Tolerant Earliest-Deadline-First Scheduling, Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International [3] 1. Ghosh, R.Melhem and D.Mosse, "-Tolerant Rate Monotonic Scheduling", Journal of Real Time Systems, 15(2): 149-181, Sept 1998. [4] 4. Hakem Beitollahi and Geert Deconinck, -Tolerant Rate- Monotonic Scheduling in Uniprocessor Embedded Systems, 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06) [5] 5. A.Christy Persya and T.R.Gopalakrishnan Nair, Tolerant Real time systems, International Conference on Managing Next Generation Software Application (MNGSA-08), Coimbatore, 2008 7