Task Synchronization and Allocation for Many-Core Real-Time Systems

Size: px

Start display at page:

Download "Task Synchronization and Allocation for Many-Core Real-Time Systems"

Holly Hicks
5 years ago
Views:

1 Task Synchronization and Allocation for Many-Core Real-Time Systems Pi-Cheng Hsiu 1,, Der-Nien Lee 3, and Tei-Wei Kuo 1,3,4 1 Research Center for Information Technology Innovation, Academia Sinica Institute of Information Science, Academia Sinica 3 Department of Computer Science and Information Engineering, National Taiwan University 4 Graduate Institute of Networking and Multimedia, National Taiwan University Taipei, Taiwan, R.O.C. pchsiu@citi.sinica.edu.tw, r97036@csie.ntu.edu.tw, ktw@csie.ntu.edu.tw ABSTRACT With the emergence of many-core systems, managing blocking costs effectively will soon become a critical issue in the design of real-time systems. In contrast to previous works on multicore real-time task scheduling algorithms and synchronization protocols, this paper proposes a dedicated-core framework to separate the executions of application tasks and (system) services over cores such that blocking among tasks can be better explored and managed. The rationale behind the framework is that we can exploit the characteristics of many-core systems to resolve the challenges raised by the systems themselves. We define three core minimization problems with respect to the constraints on core configurations, and present corresponding task allocation algorithms with optimal, approximate, and heuristic solutions. The results of simulations conducted to evaluate the proposed framework provide further insights into task scheduling in many-core real-time systems. Categories and Subject Descriptors D.4.1 [Operating Systems]: Process Management Synchronization; D.4.7 [Operating Systems]: Organization and Design Real-Time Systems and Embedded Systems General Terms Algorithms, Design, Management, Performance Keywords Many-core systems, real-time scheduling, task synchronization, task allocation 1. INTRODUCTION As multi-core systems have received a great deal of attention in recent years, system engineers will soon face major challenges in designing systems for the emerging genre of manycore systems. The pressure to develop appropriate system Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EMSOFT 11, October 9 14, 011, Taipei, Taiwan. Copyright 011 ACM /11/10...$ design methodologies will escalate rapidly because of market predictions about the demand for computer systems with tens of cores in the next decade [1]. For example, a recent study [7] posited that contention for data structures will cause the time required to complete an application program to increase from 5% on seven cores to almost 30% on 16 cores. Blocking due to resource sharing among tasks over multiple cores could easily leave cores in the idle state and degrade the system performance, thereby impacting the schedulability of real-time tasks and offsetting the computing benefits provided by multiple cores. This observation motivates us to explore solutions to the task synchronization and allocation problems in many-core real-time systems. Task scheduling is a critical area in the design of real-time systems, and many excellent fixed and dynamic priority-driven scheduling algorithms have been proposed for independent tasks [, 17, 18, 1]. However, tasks that involve resource sharing may give rise to priority inversions and incur unnecessary blocking costs as a consequence. To resolve the problem, a number of synchronization protocols have been developed [3, 9, 7]. In particular, the priority ceiling protocol (PCP) [7] proposed the concept of priority ceilings and priority inheritance to manage priority inversions and deadlocks under the well-known ratemonotonic priority assignment scheme [18]. Subsequently, the stack-based resource allocation protocol (SRP) [3] was introduced to deal with dynamic priority assignments and instances of multiple resources. The advent of multi-core systems has made real-time task scheduling and synchronization significantly more complicated. In the last decade, several multi-core real-time scheduling algorithms have been proposed for independent tasks under global scheduling with dynamic task migration [5, 13, 14] and partitioned scheduling with static task allocation [4, 4, 6]. For tasks that involve resource sharing, the concept of priority ceilings was extended by the multiprocessor priority ceiling protocol (MPCP) [5]. Under MPCP, resources shared by tasks on different processors are guarded by global semaphores in order to manage remote blocking and priority inversions. The flexible multiprocessor locking protocol (FMLP) [6] facilitates blocking by spinning (resp. suspension) for short (resp. long) critical sections so as to reduce potential remote blocking and priority inversions. The concept of virtual resources was proposed so that shared resources can be partitioned into virtual resources that are accessible by individual task groups [], and effective blocking-aware heuristics in the presence of resource constraints were proposed in [10, 11]. More recently, an excellent study explored asymptotically optimal blocking behavior 79

2 for tasks based on the number of priority inversions in multiprocessor scheduling [8]. Real-time task allocation is closely related to core minimization, especially in partitioned scheduling, where tasks are not allowed to migrate after allocation. Task allocation algorithms are developed to minimize the number of cores without violating timing constraints. It has been proved that the core minimization problem is NP-hard in the strong sense, even for independent tasks [15]. For applications in which resource sharing is not considered, various bin-packing-based heuristics have been proposed [19, 0, 3]. However, there has been comparatively little research on the core minimization problem when resource sharing is considered. The synchronization-aware task allocation algorithm [16], which relies on the well-known MPCP [5], is a closely related approach. It groups and runs tasks that access a shared resource on the same core in order to transform global resource sharing into local resource sharing. However, a group could become too large to run on one core as the number of tasks increases and sharing resources among tasks becomes more complex. The approach also reflects the difficulty in real-time task allocation and core minimization when task synchronization is considered. The interplay between task allocation, task preemption and blocking costs makes the core minimization problem extremely difficult to solve in many-core systems. In this work, we explore the strategy of executing application tasks and their (system) services separately so that blocking among tasks can be better managed, and the number of cores required to satisfy timing constraints can be reduced. The objective is to develop a methodology (and a scheduling framework) to manage transitive and direct blocking and preemption costs among tasks with respect to their schedulability considerations. To this end, we dedicate a set of cores, called service cores, to running service tasks that provide the common services requested by application tasks running on application cores, and provide interactivity between application tasks and service tasks via an RPC-like mechanism 1. Wedefinethreecore minimization problems with respect to task allocation. First, we propose an optimal algorithm for service core minimization when the application core configuration is given. We then propose approximation algorithms for application core minimization when the service core configuration is given. Finally, based on the algorithms, we develop a heuristic algorithm to minimize the total number of application and service cores without violating timing constraints. The results of simulations conducted to evaluate the proposed dedicated-core framework are encouraging in terms of the minimum number of cores required and the core utilization. The remainder of this paper is organized as follows: Section describes the system model and provides the research motivation. In Section 3, we discuss the dedicated-core framework and define the three core minimization problems. We also introduce an RPC-like mechanism to model task synchronization, and propose task allocation algorithms to solve the defined problems. In Section 4, we report the results of simu- 1 For the code/data present in the shared memory, the migration overhead is relatively light. However, there remains code/data (present in the private memory or cache) that may be used by another task inside its critical section. In fact, this is also true for traditional multi-core systems when tasks running on different cores share some resources. In this paper, we do not consider memory/cache hierarchy which is another highlycomplicated dimension in many-core real-time task scheduling. lations conducted to evaluate the performance of the proposed framework. Section 5 contains some concluding remarks.. SYSTEM MODEL AND PROBLEM DEFI- NITION In this paper, we focus on task synchronization and allocation in real-time many-core systems. Every task may have code segments, called critical sections, containing shared resources, such as variables or data structures, that can be accessed or manipulated by other tasks. Suppose the shared resources are protected by semaphores to ensure the mutual exclusion of execution in critical sections. With multitasking support, tasks can be blocked locally and/or remotely on cores because of task synchronization. This raises the following technical question: how can priority inversions be managed when a number of tasks competing for the computing cycles of multiple cores block each another, directly or indirectly, during their execution? Before discussing this issue in detail, we present the system model. We consider a set of real-time periodic tasks running in a many-core system. Let each task be associated with a time period and a worst-case execution time, and let the relative deadline of each task be equal to its period. Suppose each task can lock a set of semaphores for at most a specific amount of time during its execution. Two semaphore-locking activities overlap if their locking intervals overlap. Note that blocking may occur when tasks attempt to lock the same semaphore or compete for the computing cycles of the same core. Core1 Core t0 t1 A lock request to S τ Idle Interval A lock request to S τ1 arrives t : τ1 resumes t3 : τ t4 t5 τ3 resumes τ3 : τ3 τ releases S time releases S Figure 1: The interplay between task allocation, task preemption, and blocking costs The motivation for this research can be better illustrated by the example in Figure 1. Let a task τ with the middle priority arrive and execute on the first core when both cores are free. Suppose a task τ 3 with the lowest priority arrives later on the second core and shares a common semaphore S with τ. Because τ 3 locked S successfully at time t 0, τ s lock request to S is blocked (remotely)attimet 1; therefore, τ must wait until S is released. However, τ 3 is then preempted by a task τ 1, which has the highest priority. As a result, τ is forced to wait even longer because of the preemption of τ 3 by τ 1.Letτ 3 resume after τ 1 terminates at time t 3 and releases S at time t 4. Now, τ can lock S and continue its execution on the first core. Clearly, the above task allocation is not desirable because the preemption cost of τ 1 on τ 3 is passed onto τ indirectly along with the blocking cost of τ 3 on τ. Moreover, as the first core is idle for a long time, τ might miss its deadline. A preferable t4 time 80

3 task allocation strategy would be to allow τ 1 and τ to execute on the first core, leaving the second core free for τ 3 so that it could release S earlier. Importantly, τ can avoid the transitive preemption cost of τ 1 on τ 3. Blocking among a large number of tasks could become extremely complex and cores could easily be left in the idle state if the tasks are not allocated to cores appropriately. This in turn would degrade the system performance and offset the benefits of utilizing many cores. The situation is likely to become even more serious because the number of cores in a system is expected to increase in the coming years. To resolve the problem, we explore how many cores are required for a set of real-time tasks to ensure that no deadline is violated. This objective is to help system developers to assess the blocking costs in real-time applications. We define the target problem formally as follows: Problem Definition 1: The Core Minimization Problem Instance: Consider a set of real-time periodic tasks Ψ = {τ 1,τ,..., τ N}. Each task τ i is associated with its period/relative deadline p i, its worst-case execution time c i, and a set of semaphores S i = {s i1,s i,..., s ij },whereeachs ik may be locked by τ i for at most t ik time units. Objective: The objective is to allocate the task set Ψ to cores such that no deadline is violated, and the number of cores is minimized. 3. MANY-CORE REAL-TIME TASK SYNCHRONIZATION AND ALLOCATION 3.1 Task Synchronization There is interplay between task synchronization and task allocation, and different allocation strategies may incur different preemption and blocking costs, as shown in the example in Figure 1. Complicated inter-core/intra-core blocking/preemption scenarios make the core minimization problem extremely difficult to resolve. In this section, we propose a dedicated-core framework. The rationale behind the framework is to better utilize the potential of a large number of cores to mitigate the effect of the blocking cost issue on task allocation strategies, and thereby resolve the core minimization problem in a systematic fashion A Dedicated-Core Framework and Problem Redefinition The framework contains a set of cores, called service cores, dedicated to running service tasks for the provision of common services; for example, the protection of codes executed in a critical section is modeled as a service task. All application tasks, i.e., the real-time tasks Ψ of the problem defined in Section, run on the remaining cores, called application cores. An example of the dedicated-core framework is illustrated in Figure. Before presenting a synchronization protocol for application and service tasks, we redefine our problems because the system model now considers dedicated cores. Based on the dedicated-core framework, we first resolve two simplified problems: (1) given an application core configuration, the objective is to find a set of service tasks running on a minimized collection of service cores; and () given a service core configuration, the objective is to allocate a set of application tasks to a minimized collection of application cores. After developing algorithms for the two problems, we use them to RPC RPC : Service Task τ S j : Application Task τ A i : Service Core : Application Core Figure : An example of the dedicated-core framework resolve the following redefined target problem with dedicated cores: given a set of application tasks, the objective is to find a set of service tasks and minimize the number of cores required for the application and service tasks. Note that because dedicated cores are introduced, the core minimization problems are no longer equivalent to the problem defined in Section. Next, we formally define the three problems and show that they all remain NP-hard, unfortunately. Problem Definition : The Service Core Minimization Problem Instance: Consider a set of real-time periodic application tasks Ψ A = {τ1 A,τ A,..., τn A }. Each task τi A is associated with its period/relative deadline p i, its worst-case execution time c i, and a set of semaphores S i = {s i1,s i,..., s ij },whereeachs ik may be locked by τi A for at most t ik time units. We assume that each application task has been pre-allocated to a certain core. Objective: The objective is to find a set of service tasks Ψ S = {τ1 S,τ S,..., τk} S and their allocation to the service cores such that the critical sections guarded by each semaphore s ik S i, i, are serviced by a corresponding service task. No deadline is violated, and the number of service cores is minimized. Theorem 1. The service core minimization problem is NPhard. Proof. This theorem can be proved by a reduction of the bin packing problem [1], which involves packing N items of various sizes into fixed-size bins such that the number of bins needed is minimized, by transforming each item in the bin packing problem into an application task in our problem: Each application task is associated with its period/relative deadline, which is equal to the bin size; its worst-case execution time, which is equal to the size of the corresponding item; and one semaphore that can be locked for the task s entire execution time. All the application tasks lock different semaphores; hence, each semaphore corresponds to one item. The problem can be proved to be NP-hard (even for this simple, special case) by showing that the N items can be packed into Z bins if The detailed proofs of the theorems and lemmas are omitted in this paper due to the space limitation. 81

4 and only if the N semaphores can be assigned to a set of service tasks running on Z service cores such that no application task violates its deadline. Problem Definition 3: The Application Core Minimization Problem Instance: Consider a set of real-time periodic application tasks Ψ A = {τ1 A,τ A,..., τn A }. Each task τi A is associated with its period/relative deadline p i, its worst-case execution time c i, and a set of semaphores S i = {s i1,s i,..., s ij },whereeachs ik may be locked by τi A for at most t ik time units. Given a set of service tasks Ψ S = {τ1 S,τ S,..., τk} S and their allocation to the service cores, the critical sections guarded by each semaphore s ik S i, i, are serviced by a corresponding service task. Objective: The objective is to allocate the set of application tasks Ψ A to application cores such that no deadline is violated, and the number of application cores is minimized. Theorem. The application core minimization problem is NP-hard. Proof. The proof of this theorem is similar to that of Theorem 1, in that one application task corresponds to one item; however semaphore locking is not necessary in the reduction. Since the bin size is equal to the common deadline of the application tasks, a solution with the minimum number of application cores yields to a corresponding solution to the bin packing problem instance. Problem Definition 4: The Core Minimization Problem with Dedicated Cores Instance: Consider a set of real-time periodic application tasks Ψ A = {τ1 A,τ A,..., τn A }. Each task τi A is associated with its period/relative deadline p i, its worst-case execution time c i, and a set of semaphores S i = {s i1,s i,..., s ij },whereeachs ik may be locked by τi A for at most t ik time units. Objective: The objective is to find a set of service tasks Ψ S = {τ1 S,τ S,..., τk} S and their allocation to service cores such that the critical sections guarded by each semaphore s ik S i, i, are serviced by a corresponding service task. No deadline is violated, and the total number of service and application cores is minimized. Theorem 3. The core minimization problem with dedicated cores is NP-hard. Proof. The correctness of this theorem follows directly from the fact that the previous two problems are restricted cases of the problem A Task Synchronization Protocol Before presenting our solutions to the three core minimization problems, we introduce a protocol for the synchronization of application and service tasks in the dedicated-core framework. Under the protocol, service tasks process service requests from application tasks in an on-demand fashion. In other words, a service task is idle only if its service queue is empty. We propose using a Remote-Procedure-Call-like (RPClike) mechanism to handle service requests and return the results to the application tasks, which are blocked and waiting for completion of their service requests. If an application task can lock more than one semaphore in an overlapping fashion, we assume that the corresponding critical sections are serviced by the same service task. Note that this assumption could lead to a large service task and reduce the parallelism of task executions. However, the RPC-like mechanism is not intended to be a replacement for semaphores, but it could be a valid alternative for real-time tasks if programmers keep this restriction in mind. If necessary, programmers could still employ semaphores for critical sections that are overlapped in a complex fashion, and the corresponding schedulability analysis could be incorporated into the analysis under the proposed protocol. Algorithm 1 A synchronization protocol for task execution under the dedicated-core framework I Priority Assignment: The priorities of application tasks are assigned based on the rate-monotonic priority assignment. The priority of an application task s service request is set as the priority of the task. The priority of a service task is set as the priority of the request it is currently processing. II Service Request Handling: Each service request from an application task is added to the service queue of the corresponding service task. The application task then suspends itself until its service request is completed. Service requests in the same service queue are serviced in a non-preemptive manner based on their priorities. III Task Scheduling: On each application core, the application task with the highest priority among ready application tasks is dispatched and scheduled in a preemptive manner. On each service core, the service task with the highest priority among ready service tasks is scheduled in a non-preemptive manner. A service task is idle when it does not have any service request to process. The synchronization protocol is summarized in Algorithm 1. In this study, we adopt the rate-monotonic priority assignment (RM) to assign application tasks. RM gives a higher priority to a task with a smaller time period, and tie-breaking can be done in an arbitrary fashion, such as the tie-breaking based on task identification [18]. To handle service requests, we utilize priority inheritance [7], as a similar rationale behind global semaphores to avoid uncontrolled blocking costs [5]. That is, a service task inherits the priority of the application task it is currently servicing, and the scheduling of service tasks and the processing of service requests are both performed in a nonpreemptive fashion so that blocking costs can be bounded. The proposed protocol is illustrated by the example in Figure 3: Let application task τ A execute on the first application core, and application tasks τ1 A and τ3 A execute on the second application core. In addition, let τ A and τ3 A share a common semaphore to protect their critical sections, and create service task τ1 S to service those sections. At time t 0, τ3 A issues a service request to τ1 S via the RPC mechanism and then suspends itself until the RPC returns the result. The service task τ1 S starts the service with the priority of τ3 A at time t 0. At time t 1, τ A issues a service request to τ1 S and suspends itself. Since service requests are processed in a non-preemptive manner, τ1 S will not start processing the new request until time t. As a result, the first application core remains idle until τ A s service 8

5 Application Core1 t0 Application Core Service Core t1 Idle : τ A 1 An RPC request to τ S 1 resumes τ A Idle Interval t'3 An RPC request to τ S 1 t τ A 1 arrives t' : τ A t3 τ A 3 resumes τ S 1 completes RPC of τ A 3 τ S 1 completes RPC of τ A Figure 3: Task executions based on the dedicated-core framework request is completed at time t 3. At time t, τ1 A arrives and starts executing because it is the only task that is ready. On completion of τ3 A s service request at time t, the task returns to the ready state. However, τ3 A is not scheduled at time t because its priority is lower than that of τ1 A Schedulability Analysis Next, we provide a sufficient condition under which a set of real-time application tasks that use the synchronization protocol can be scheduled by the rate-monotonic algorithm. The schedulability test will be used in Section 3. to design the task allocation algorithms mentioned earlier. In the following schedulability analysis, local blocking refers to the phenomenon where a higher-priority application task is blocked by a lowerpriority application task running on the same core. When a higher-priority application task is blocked because the corresponding service task is servicing a lower-priority application task, the former is said to be remotely blocked by the latter. On the other hand, a lower-priority application task suffers from remote preemption if the corresponding service task is servicing a higher-priority application task. Given a configuration of application and service tasks distributed over many cores, the schedulability of the tasks and their scheduling behavior are explained by the following lemmas and theorem. Lemma 1. There are no local blocking costs among application tasks. Proof. Critical sections are serviced by service tasks distributed over service cores, and an application core is always assigned to the ready application task with the highest priority. Thus, no application task can be blocked by a lower-priority task on its host core. Lemma. Suppose application task τi A can issue at most γ i service requests, and b maxi is the longest period required to service any lower-priority request. Then, τi A would be subject to a remote blocking cost of at most B i = γ i b maxi within one period. Proof. The scheduling of service requests and service tasks are both performed in a non-preemptive priority-driven fashion. Thus, any service request of τi A can only wait for one lowerpriority request with a duration no longer than b maxi to be serviced. t4 : τ A 3 t5 time time time : τ S 1 Lemma 3. Let Hi S Ψ A be the set of higher-priority application tasks that can issue requests to a common service core as τi A ;andletĉ j,π be the execution time of the critical sections of τj A Hi S on service core π. IfΠ i is the set of service cores that τi A can issue requests to, τi A would be subject to a remote preemption cost of at most P i = π Π i τ j A ( p i HS i p j ĉ j,π) within one period. Proof. Each high-priority task τj A Hi S can arrive at most p i p j times during one period of τi A ; and each arrival can be serviced by service tasks on each service core π Π i for a period no longer than ĉ j,π. In the dedicated-core framework, a higher-priority application task can be blocked from completing its service request and resume its execution later. As a result, a lower-priority application task might be subject to an extra preemption when a higher-priority application task resumes it execution after a blocked service request. Such an extra preemption cost is referred toas theresumption cost of the lower-priority application task. Lemma 4. Let Hi A Ψ A be the set of higher-priority application tasks allocated to the same application core as task τi A ; and let c j be the execution time of the non-critical sections of τj A Hi A. In one period, the resumption cost of τi A is no more than D i = τ j A c HA i j. Proof. In one period, each higher-priority task τ A j H A i can contribute a resumption cost of at most c j to τ A i. Lemma 5. An application task τi A will not miss its deadline if c j ci + Bi + Pi + Di + h( h 1 1), (1) p j p i τ A j HA i where h is the number of application tasks in H A i {τ A i }. Proof. This lemma follows directly from Liu and Layland s sufficient schedulability test [18], which regards remote blocking costs, remote preemption costs, and resumption costs as extra computation time for τi A (The proof is similar to that of the priority ceiling protocol [7]). Theorem 4. A set of real-time periodic application tasks Ψ A = {τ A 1,τ A,..., τ A N } is schedulable if Equation (1) holds for each τ A i,wherei =1,,..., N. Proof. The proof of this theorem follows directly from that of Lemma Task Allocation In the following, we utilize the schedulability test derived under the dedicated-core framework to develop an optimal pruneand-search algorithm for the service core minimization problem (Section 3..1), as well as two approximation algorithms for the application core minimization problem (Section 3..). We then utilize the algorithms to resolve the core minimization problem with dedicated cores in an iterative fashion (Section 3..3). 83

6 3..1 Service Core Minimization First, we resolve the service core minimization problem for a given application core configuration. Because the problem is known to be NP-hard, we try to derive good quality solutions within a reasonable amount of time. Although the proposed algorithm is based on the rate-monotonic scheduling algorithm for task scheduling (and its schedulability analysis), we should point out that it could go with other scheduling algorithms. Before presenting the proposed algorithm, we define some terms. Consider a set of application tasks Ψ A, in which each task τi A has been allocated to a certain application core. Let Ψ S = {τ1 S,τ S,..., τk} S be the set of service tasks for Ψ A,whereoverlapping critical sections are serviced by a single service task. A configuration of a service task set is denoted as a set of subsets of service tasks, e.g., {(τ1 S,τ S ), (τ3 S ), (τ4 S,τ5 S )}, where service tasks in the same subset are allocated to the same service core, and the service task set is partitioned exclusively into the subsets of the configuration. A service core is monopolized if it only has one service task, such as a core for (τ3 S ); otherwise, it is shared, such as one for (τ1 S,τ S ). A k-configuration C k contains exactly k shared service cores. Given a k-configuration C k and a l-configuration C l, the binary operator of the two configurations yields another configuration C m,wherem = k + l, such that subsets of the shared cores in C k or C l stay in C m ; and every service task that does not belong to any subset of a shared core contributes to a single-element subset. If any service task appear in more than one subset, the resulting configuration is invalid; otherwise,itisvalid. A valid configuration is schedulable if all the application tasks are guaranteed to meet their deadlines (cf. Theorem 4). A configuration set C k is complete in terms of configurations with exactly k shared service cores (or k-configurations) if the set contains all schedulable k-configurations. Given two configuration sets C k and C l,the binary operator of the two sets yields another configuration set C k C l = {C k C l C k C k and C l C l }. Algorithm shows the algorithm developed to solve the service core minimization problem. Consider a configuration of an application task set Ψ A with K groups of semaphores, where semaphores that correspond to any overlapping critical sections are in the same group. The algorithm starts with a set of service tasks, each of which processes all the critical sections corresponding to one semaphore group (Line 1). Then, for the K service tasks, the algorithm tries to find a schedulable configuration with the minimum number of service cores (Lines -19). If the 0-configuration C 0 for Ψ S is not schedulable (according to Theorem 4), the algorithm terminates and reports the non-existence of a solution (Lines -3), since a schedulable configuration cannot exist (see Lemma 9). Otherwise, the algorithm sets C 1 as the set of all the schedulable 1-configurations for Ψ S (Line 4). Note that the set can be created by considering all of the subsets of Ψ S.IfC 1 is empty, then C 0 is the only schedulable solution, and it is returned (Lines 5-6). Otherwise, the algorithm uses C 1 and C i 1 to derive C i,fori =, 3,..., K (Lines 8-18). For each C i (Lines 1-18), the algorithm identifies all the schedulable i-configurations in C i 1 C 1 and uses them as the elements of C i (Lines 13-14). In the for loop, the algorithm first checks whether the number of service cores of C min is not more than i (Lines 9-10). If this is confirmed, the algorithm breaks the loop because further exploration of C i+1 would not yield a configuration with fewer service cores than i. A similar check is performed by Lines 17 and 18. Note that only schedulable i-configurations in C i 1 C 1 are included in C i.as proved later by Lemma 9, if an i-configuration C i is not schedu- Algorithm The service core minimization algorithm Input: A configuration of an application task set Ψ A and K groups of semaphores, where semaphores that correspond to any overlapping critical sections are in the same group Output: A schedulable configuration with a service task set Ψ S such that the number of service cores is minimized 1: Let Ψ S = {τ1 S,τ S,..., τk} S be a service task set in which each τg S service all of the critical sections corresponding to the g th group, g =1,,..., K. : if C 0 for Ψ S is not schedulable then 3: return no solution 4: C 1 all of the schedulable 1-configurations for Ψ S 5: if C 1 is empty then 6: return C 0 7: C min C 1 C 1 with the minimum number of service cores 8: for i, 3,..., K do 9: if the number of cores of C min is no more than i then 10: return C min 11: C i φ 1: for all C i C i 1 C 1 do 13: if C i is a schedulable configuration then 14: C i C i {C i } 15: if the number of cores of C i is less than that of C min then 16: C min C i 17: if the number of cores of C min is no more than i then 18: return C min 19: return C min lable, it follows that no (i + 1)-configuration C i+1 = C i C 1 is schedulable. In this way, the algorithm removes infeasible i-configurations to avoid testing (i + 1)-configurations that are impossible to schedule, and thereby reduces the search space. The algorithm always retains the configuration with the minimum number of service cores (Lines 15-16). When the for loop terminates, the best configuration C min is returned (Line 19). Properties: In the reminder of this section, we analyze the time complexity of Algorithm and prove its optimality for the service core minimization problem (with respect to Theorem 4). Note that different, and possibly better schedulability tests could be adopted by Algorithm, and the proof of optimality would follow accordingly. Lemma 6. The time complexity of Algorithm is O(B K( K + KN )), wherek and N are the numbers of service tasks and application tasks respectively; and B K = K 1 ( K 1 ) i=0 Bi is the i K th Bell number, which is the number of possible partitions of asetwithk elements. Proof. The time complexity is dominated by the construction of the K complete configuration sets, including the construction of C 1, the execution of the binary operator on C i 1 and C 1 for i =, 3,.., K, and the identification of schedulable configurations. The time complexity is O( K i=1 C i 1 C 1 + K i=1 C i (KN )), where O(KN ) is the time complexity of one schedulability test. Since C 1 < K and K i=0 C i B K, the proof follows. Lemma 7. The minimum number of service cores needed when non-overlapping critical sections are processed by the same 84

7 service task will not be less than the number required when the sections are processed by different service tasks. Proof. Let s 1 and s be two semaphores that guard nonoverlapping critical sections. An allocation when the corresponding critical sections of s 1 and s are serviced by the same service task is also possible when they are serviced by different service tasks, but not vice versa. Lemma 8. If a configuration is schedulable, then moving any service task from a shared service core to a new monopolized service core will still result in a schedulable configuration. Proof. This lemma follows from the fact that the remote preemption and blocking costs of any application task will not be increased by such a movement. Lemma 9. If a k-configuration C k is schedulable, then configurations C l and C k l are both schedulable, where C k = C l C k l, for any 0 l k. Proof. Let us consider any configuration C l, where C k = C l C k l.thelcorresponding shared cores of C k are preserved for C l, and we give each service task on the remaining k l shared cores a new monopolized service core. According to Lemma 8, C l is schedulable. Lemma 10. If Algorithm leaves the loop with i = J, then a configuration set C j is complete in terms of j-configurations, for any 1 j (J 1). Proof. This lemma can be proved by an induction on i. As the induction basis, C 1 is complete. According to Lemma 9, if an i-configuration C i is schedulable, then C i 1 and C 1 are also schedulable. Since both C i 1 and C 1 are complete, C i must be in C i 1 C 1 and included in C i. Theorem 5. If a schedulable configuration exists, Algorithm always terminates and returns a schedulable configuration with the minimum number of service cores. Proof. If the algorithm terminates at Line 3, there is no solution; and if it terminates at Line 6, there is no schedulable configuration with any shared service core, so the 0-configuration is the only solution. Otherwise, an optimal configuration must be an i-configuration, for some 1 i K. If the algorithm does not break from the loop, then an optimal configuration must be found and returned at Line 19. If it breaks the loop at Line 10 or Line 18, then C min is optimal because further exploration on C i+1 wouldnotyieldaconfigurationwithfewer service cores than i. 3.. Application Core Minimization Next, we consider the application core minimization problem for a given service core configuration. Because the problem is known to be NP-hard, and the number of application tasks in a many-core system could be considerable, it might not be applicable to develop an optimal algorithm whose time complexity is exponential in terms of the number of application tasks. Therefore, we propose two algorithms: the rate-monotonic next-fit with core dedication (RMNF-CD) algorithm, and the rate-monotonic first-fit with core dedication (RMFF-CD) algorithm, which are based on the next-fit and first-fit allocation strategies respectively. Both strategies have been studied for multiprocessor task allocation and shown to have elegant properties [3, 6]. However, existing approaches only consider independent tasks (i.e., there is no resource sharing among tasks). In this section, we explain how to apply the next-fit and first-fit strategies to the task allocation problem when tasks share resources. We also provide an approximation bound for RMNF-CD and prove that RMFF-CD never uses more application cores than RMNF-CD. The simulation results reported in Section 4 show how much RMFF-CD outperforms RMNF-CD. Algorithm 3 RMNF-CD for application core minimization Input: A configuration for a set of service tasks Ψ S,andaset of application tasks Ψ A = {τ1 A,τ A,..., τn A } Output: A schedulable configuration for Ψ A such that the number of application cores is minimized 1: Sort Ψ A in decreasing order of the task priority : j 1 3: C {α 1} 4: for all τi A Ψ A in decreasing order do 5: if τi A cannot meet its deadline on application core α j then 6: j j +1 7: C C {α j} 8: Allocate τi A to run on application core α j 9: return C As shown in Algorithm 3, RMNF-CD first sorts the application tasks in decreasing order of priority. Then, for each application task τi A Ψ A,itchecksifτi A is schedulable on the current application core α j (by using the schedulability test in Lemma 5) after the allocation of the first i 1 application tasksiscompleted.ifτi A is schedulable, τi A is allocated to α j; otherwise, a new application core α j+1 is created for τi A and considered as the current application core. The algorithm returns a schedulable configuration C when all the application tasks have been allocated. RMFF-CD is the same as RMNF- CD, except that it always checks from the beginning of the existing application cores and assigns τi A to the first one so that τi A is schedulable, instead of only checking the current application core. Note that, in both algorithms, the tasks are sorted in decreasing order of priority so that the inclusion of an application task does not affect the schedulability of the previously allocated higher-priority tasks. Properties: In the remainder of this section, we analyze the time complexity of the two algorithms and derive their approximation bounds. Lemma 11. The time complexity of RMNF-CD is O(KN ), where K and N are the numbers of service tasks and application tasks respectively. Proof. The time complexity is dominated by the time required to test if each of the N application tasks can meet its deadline basedonlemma5. EachtesttakesO(N) time to compute Equation 1 if B i and P i, 1 i N, in the equation are pre-computed in O(KN ) time. Thus, the algorithm takes O(N + KN )time. Lemma 1. The time complexity of RMFF-CD is O(N 3 ), where N is the number of application tasks. 85

8 Proof. Algorithm RMFF-CD performs at most O(N) tests for each application task. Thus, it takes O(N 3 + KN )time. We now derive an approximation bound for RMNF-CD and prove it does not yield less application cores than RMFF-CD; thus, the bound can also be applied to RMFF-CD. To derive the approximation bound, we define the expanded task set Ψ of an application task set Ψ A as follows: the period p i of a task τ i Ψ is equal to the period p i of the corresponding task τi A Ψ A. The execution time c i of τ i is equal to the smaller one of the period p i and the summation of the following: the execution time c i, the remote blocking cost B i, the remote preemption cost P i, and the execution time of the non-critical sections c i of τi A. Note that the tasks in Ψ are independent of one another. c i, p i Let the total utilization of Ψ be denoted by U Ψ = τ i Ψ and let the non-critical section utilization of Ψ A be denoted by U ΨA = c i τ i A ΨA p i,where c i is the execution time of the noncritical sections of τi A. Suppose the tasks in Ψ are allocated to cores by RMNF [6], an algorithm that is the same as RMNF- CD except that Liu and Layland s schedulability test [18] is applied in Line 5 of Algorithm 3 instead. We now compare the number of application cores yielded by RMNF-CD for Ψ A with the number yielded by RMNF for Ψ. Lemma 13. The number of application cores derived by RMNF-CD for an application task set Ψ A is not more than the number derived by RMNF for the corresponding expanded task set Ψ. Proof. Thelemmacanbeprovedbyshowingthat 1 i N if task τi A is allocated to core α j and task τ i is allocated to core α j ;thenj j via an induction on i. Lemma 14. Any set Ψ of independent tasks can feasibly be allocated to run on n cores by RMNF if their total utilization U Ψ n( 1). Proof. This lemma has been proved in [3]. Theorem 6. The number of application cores derived by RMNF-CD is not more than ( +1) U Ψ times the number U Ψ A derived by an optimal algorithm. Proof. By Lemmas 13 and 14, the number of application cores derived by RMNF-CD for Ψ A is not more than ( +1)U Ψ. Since Ψ A cannot be allocated to less than U ΨA application cores by any algorithm, the proof follows. Theorem 7. The number of application cores derived by RMFF-CD is not more than the number derived by RMNF-CD for the same application task set Ψ A. Proof. This theorem can be proved in a similar way to the proof of Lemma Core Minimization with Dedicated Cores We resolve the core minimization problem with dedicated cores by solving the service core and application core minimization problems in an iterative manner. To this end, we propose the core minimization next-fit (CMNF) and core minimization first-fit (CMFF) algorithms. The only difference between them is the algorithm adopted for application core minimization, i.e., RMNF-CD or RMFF-CD. The steps of CMNF and CMFF are as follows. Given an application task set Ψ A, a preliminary application core configuration is initialized (as described below). Next, the algorithms apply the service core minimization algorithm (presented in Section 3..1) to the initial application core configuration to find an optimal service core configuration. Then, an application core minimization algorithm (i.e., RMNF-CD or RMFF-CD) is applied to the derived service core configuration to find a better application core configuration. As the service core minimization algorithm is optimal, the process repeats until the application core minimization algorithm can no longer reduce the number of application cores required to execute the application tasks. The initial application core configuration can be chosen according to some design constraints or preferences. In this paper, we propose to initialize an application core configuration in a load-balancing manner such that the utilization of each application core is slightly lower than 41%. The value is chosen based on the worst-case achievable utilization 1 41% for independent tasks in multi-core systems, as reported in [3]. Because resource sharing among tasks could impact the schedulability in real-time systems, the utilization for our case should be slightly lower than that for independent tasks. Note that if the utilization value is set too high, some application tasks could miss their deadlines, so no feasible service core configuration would be found. On the other hand, if the utilization value is set too low, service tasks could be allocated to a small number of service cores. This could result in a situation where the remote preemption/blocking cost of application tasks would be significant, and the local preemption cost that application tasks could tolerate would be low. Consequently, further reduction of the number of application cores would be impossible, and the computing cycles of the application cores would be wasted. 4. PERFORMANCE EVALUATION 4.1 Simulation Setup and Performance Metrics In this section, we evaluate the proposed algorithms, CMNF and CMFF, and provide insights into the designs of many-core systems with dedicated cores. We utilize two performance metrics: (1) the number of cores used (which shows the capability of core minimization); and () the average core utilization achieved (which shows the efficiency of core usage). The simulations were conducted on a set of N real-time application tasks. Each task τi A was associated with a utilization c i p i generated with a normal distribution, where both the mean and standard deviation were set at The ratio of the critical sections to the execution time ( ĉi c i of τi A ), called the criticalsection ratio, was normally distributed with the mean Δ and the standard deviation (Δ 75%). A set of K = N service 5 tasks were used to service the critical sections of the N application tasks, and the critical sections were matched with the K service tasks in a fair fashion. The number of RPC requests that an application task issued to service tasks was selected uniformly in the range 1 to 3. In the simulations, the preliminary application core configuration was initialized in a loadbalancing fashion such that the utilization of each application core was between 0.5 and 0.3. We investigated the impacts of the number of application tasks N varied from 0 to 70 (when Δ = 0.15), and the impact of the critical-section ratio Δ varied from 0.05 to (when N = 50). The derived results were the average values over 0 independent simulations. 86

9 Number of Cores TOTAL-NF TOTAL-FF APP-NF APP-FF SERVICE-NF SERVICE-FF Number of Cores TOTAL-NF TOTAL-FF APP-NF APP-FF SERVICE-NF SERVICE-FF Average Core Utilization (%) Number of Application tasks (N) APP-NF APP-FF SERVICE-NF SERVICE-FF (a) Δ = 0.15 Figure 4: The number of Cores Number of Application Tasks (N) (a) Δ = 0.15 Average Core Utilization (%) Critical-Section Ratio (Δ%) APP-NF APP-FF SERVICE-NF SERVICE-FF Figure 5: Average Core Utilization (b) N = Critical-Section Ratio (Δ%) (b) N =50 4. Simulation Results Figure 4(a) shows the impact of the number of application tasks on the number of cores derived by CMNF and CMFF. As expected, the total number of cores (i.e., TOTAL-NF and TOTAL-FF) increased as N increased. We observe that the algorithms derived almost the same number of cores. However, CMFF used fewer application cores and more service cores than CMNF. This is because the number of application cores derived by RMFF-CD is never more than the number derived by RMNF-CD for the same task set, as proved in Theorem 7. The results show that CMNF used 0% more application cores than CMFF, while CMFF used 44% more service cores than CMNF, when 0 N 70. Figure 4(b) shows the impact of the critical-section ratio on the number of cores derived by CMNF and CMFF. The total number of cores (i.e., TOTAL-NF and TOTAL-FF) increased as Δ increased. The increase was primarily due to the increase in the number of service cores, as the number of application cores only changed slightly. This is because the utilization of service tasks increased with Δ, so more service cores were needed. Although the utilization of application tasks decreased with Δ, the potential decrease in the number of application cores was offset by the higher remote blocking/preepmtion cost they would incur, as analyzed in Lemmas and 3. We observe that, as the critical section ratio increased, CMNF started to outperform CMFF when Δ = 15% in terms of the number of cores used. This occurred because tasks were subject to further remote blocking/preemption costs with an increased Δ, and RMFF-CD tended to use fewer application cores than RMNF-CD. Consequently, the number of service cores used under CMFF increased more quickly than under CMNF to compensate for more competition on the application cores. Moreover, for the same reason, CMFF used fewer application cores and more service cores than CMNF. As 5% Δ 17.5%, the results show that CMNF used 18% more application cores than CMFF, while CMFF used 45% more service cores than CMNF. Figure 5(a) shows the impact of the number of application tasks on the average core utilization under CMNF and CMFF. Clearly, the average utilization of application cores (i.e., APP- NF and APP-FF) did not change significantly with N. The reason is that, although the total application core utilization increased as the number of application/service tasks increased, the number of application cores also increased, as shown in Figure 4(a), and the utilization of each core remained similar. In contrast, service core utilization under CMNF and CMFF decreased with N, which seems to violate the intuition because the total demand for services increased. The reason is that, as the total utilization of service cores gradually increased with N, more service cores were needed to meet the timing constraints of tasks, since more remote preemption/blocking costs could be imposed on lower-priority tasks. There was a hike as N increased from 10 to 0 because the number of service cores was still one before the break point. In addition, we observe that the utilization of application cores under CMFF was higher than under CMNF. This phenomenon was expected because, as mentioned earlier, CMFF used fewer application cores than CMNF. For a similar reason, the opposite phenomenon occurred in the utilization of service cores. The results show 87

An Improved Priority Ceiling Protocol to Reduce Context Switches in Task Synchronization 1

An Improved Priority Ceiling Protocol to Reduce Context Switches in Task Synchronization 1 Albert M.K. Cheng and Fan Jiang Computer Science Department University of Houston Houston, TX, 77204, USA http://www.cs.uh.edu