Provably Efficient Non-Preemptive Task Scheduling with Cilk

Size: px

Start display at page:

Download "Provably Efficient Non-Preemptive Task Scheduling with Cilk"

Martha Hall
6 years ago
Views:

1 Provably Efficient Non-Preemptive Task Scheduling with Cilk V. -Y. Vee and W.-J. Hsu School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore Abstract We consider the problem of scheduling static task graphs by using Cilk, a C-based runtime system for multithreaded parallel programming. We assume no pre-emption of task execution and no prior knowledge of the task execution times. Given a task graph G, the output of the scheduling algorithm is a Cilk program P which, when executed, initiates the tasks in consistence with the precedence requirements of G. We show that the Cilk model has restrictions in implementing optimal schedules for certain types of task graphs; however, the restriction does not fundamentally hinder the practical applications of Cilk, as it is possible to produce reasonably good quality schedules (in the sense of expected execution time). Our algorithm identifies a minimal number of stages, assigns tasks to these stages, and bundles parallel tasks of the same stage into one Cilk procedure. By using Tarjan s algorithm (for set operations) to implement the bundling process, we demonstrate that the parallel schedule can be derived in O(n+e) time for all practical purposes, where n and e denote the number of nodes and edges in the task graph G. With P processors, the expected completion time for the scheduled tasks is bounded by Tp = O(T 1 /P+S), where T 1 denotes the total work, i.e., the time required for executing all tasks on a single processor, and S denotes the sum (over all stages) of the longest execution time of the tasks at each stage. When the execution times of tasks are relatively homogeneous, the quality of the schedule generated by using our approach is nearly optimal. Index Terms: Scheduling, Cilk. 1 Introduction Task scheduling is one of the most challenging problems in parallel and distributed computing [1]. The scheduling of an arbitrary task graph on arbitrary number of processors is NPcomplete [1,6]. We will introduce a sub-optimal approach to perform task scheduling based on Cilk, a provably efficient run-time system for multithreaded programming [2,4]. A set of tasks T will be specified as a strict (irreflexive) partially ordered set (T, <). The relation u < v denotes that the computation of task v depends on the results of the computation of task u, i.e., < specifies precedence constraints. If u < v, u is said to be a predecessor of v, and v a successor of u. The partial order < is conveniently represented as a directed acyclic graph called a task graph. A directed edge (i, j) from task T i to T j specifies that T i < T j. Two tasks T i and T j are comparable if either T i < T j or T j < T i holds (by transitivity); otherwise, they are incomparable. A feasible schedule must preserve all precedence relations. A schedule is efficient if it 1

2 minimizes the total execution time. Given a task graph, the classic task-scheduling problem is to find a feasible and efficient schedule for the assignment of the tasks to the processors. We present an efficient scheduling algorithm for implementing arbitrary task graphs by using Cilk, a C-based runtime system for multithreaded parallel programming [2,4]. Given a task graph G, the output of the scheduling algorithm is a Cilk program P which, when executed, initiates the tasks in a manner consistent with the precedence requirement of G. 1.2 The Cilk Model A Cilk multithreaded computation is viewed as a series-parallel dag that unfolds dynamically as the computation progresses. Threads are embedded in a tree of Cilk procedures. Each thread is a nonblocking C function. A thread can spawn other Cilk threads that begin new child procedures. The Cilk runtime system uses a provably efficient scheduler based on the concept of randomized work-stealing [5]. Because of this feature, the schedule of task graphs generated by our approach will also be nondeterministic. Nevertheless, the invoking sequence of the tasks is guaranteed to be consistent with the given task graph. The Cilk runtime system provides a performance model based on two parameters: work and critical path length. The term T 1 (called work) is the total time required executing the program on one processor. T is the time required for threads along the longest dependency path in the dag. It is shown [2] theoretically that Cilk s work-stealing scheduler executes a Cilk computation on P processors in time T p = O(T 1 /P+ T ), which gives an asymptotically optimal performance. It has also been empirically verified that the constant factor hidden by the order notation is a small one, i.e., it is a good approximation to the execution time on P processors [2]. Another crucial property of the Cilk model is the dag-consistent distributed shared memory model [3,7]. It is a lock-free consistency model suitable for multithreaded programming environment. The idea is that each thread sees values that are consistent with some serial execution order of the dag, but two different threads may see different serial orders. Thus, the writes performed by a thread are seen by its successors, but threads that are incomparable in the dag may or may not see each other s write. However, Theorem 1.1 shows that the dag-consistency model imposes a basic restriction in

3 implementing an optimal schedule for general task graphs. Theorem 1.1 There exist task graphs, for which an optimal schedule cannot be implemented by using Cilk procedures. Proof. (Omitted. [11]) Theorem 1.1 holds true because Cilk is designed for tree-like computations, instead of general computations. Given this, the best we can attempt is to implement a sub-optimal task scheduler with Cilk. 2. The Scheduling Algorithm Our algorithm consists of two phases. In the first phase, we apply bundling algorithm to transform a given task graph to a tree-structured bundle tree. Bundling algorithm is a dag-to-tree transformation. In the second phase, we map the bundle tree to the Cilk model of multithreaded computation. The main ideas of the algorithms will be explained while the details are referred to [11]. 2.1 Bundling Algorithm We will apply bundling algorithm to perform our dag-to-tree transformation. First partition the nodes of the task graph into subsets called bundles. All of the tasks that belong to the same bundle must be incomparable, such that they can be executed in parallel. Finally, if we treat a bundle as a node to construct a new graph G, and assign an edge from bundle B i to bundle B j if any task in B i is a predecessor of another tasks in B j, the graph G would be tree-structured. With the bundle tree, the tasks can then be scheduled in Cilk easily Bottom-up Bundling Assign each task in the task graph a stage such that the tasks in the same stage are incomparable to each other. The stage number for a given task node is defined to be one greater than the maximum stage numbers of all its successors, which can be determined by a breadth-first topological sort [10, 11]. In Bottom-up Bundling, consider the tasks in the last stage and proceed towards the earliest stage in a stage-by-stage fashion. As the bundling progresses, the bundles form several trees (referred to as partial bundle trees). Finally, they are linked together to form one final bundle tree. At each iteration, visit a task node t that has no successor or whose successors have all been visited. Create a new bundle Bn containing t (MAKESET (t) ). Then, check all the bundles that embed its immediate successors x ( FINDSET (x) ). This is to determine the relation of Bn and these bundles. For a bundle that embeds a successor x, find

4 the root bundle Br of the partial bundle tree that contains x (FINDROOT (x) ). If Bn Br and if Bn and Br belong to the same stage, we merge them together (UNION (Bn, Br) ); otherwise (Bn would be in the later stage), link them together and make Bn as the new root bundle. To perform the merging of bundles efficiently, we regard the bundles as disjoint sets and apply Tarjan s algorithm to handle the disjoint set operations: FINDSET, MAKESET and UNION [8,9]. Specifically, by regarding the so-called root bundle as a set, the same technique for maintaining disjoint sets can be applied. To determine the root bundle of any given bundle, we keep a direct reference (pointer) to the root bundle. Then apply the Path compression [9] technique to update this pointer after every root query (FINDROOT) operation. That the bundling produces a tree of bundles should be clear from the algorithm. Note that any task node would be queried by all its immediate predecessors. If any two immediate predecessors are in the same stage, they will be merged together. If they are not in the same stage, we make the one in the earlier stage as a predecessor of the one in the later stage, while eliminating the direct dependency of the former predecessor and the node being queried. This guarantees that every bundle has at most one parent, which in turn shows that a tree-structured bundle tree will be formed. It is also straightforward to verify (inductively) that the operational precedence specified by the bundle tree does not conflict with that of the input dag. Clearly, the tasks included in the root bundle are to be executed first. The child bundles are the successors for a given bundle. The directed edges reflect the operational precedence among bundles. The tree-to-dag conversion is, therefore, correct. 2.2 Generating Cilk procedures The key idea is, when a bundle becomes ready (i.e., when all predecessors of the tasks in the bundle are completed) we can spawn a Cilk procedure to execute the tasks included in the bundle. When all tasks have been executed, we spawn a Cilk procedure to handle each of its child bundle. 3 Performance Analysis Theorem 3.1 Let n denote the number of nodes and e the number of edges in a given task graph G. The time complexity for producing a parallel schedule is no more than O((n+e)(1+a(n, (n+e)))), where a(x, y) represents the inverse of the Ackermann s function.[9]. Proof. (Omitted. Cf. [11])

5 The following bound applies to the total execution time of the schedule. denotes the work, and S denotes the summation (over all stages) of the longest execution time of the tasks Lemma 3.2 For any task graph with work T 1 and at each stage. critical path length T and for any number P of processors, any greedy P-processor execution schedule achieves T p = T 1 /P + T. Proof. (Omitted. Cf. [11]) Notice that the bundling algorithm groups the tasks into several bundles, which will be spawned off and executed in a batch. Because of this, there can be no way to schedule one or more ready tasks in an unready bundle even if idle processors are available. A ready task will be scheduled if and only if the bundle embedding it becomes ready and there is idle processor. Clearly, if all the tasks in the same stage are already executed, all the tasks on the next lower stage level should become ready to be scheduled. The reason is simple: the predecessors of all the tasks in the next lower stage level should have been executed under this condition. In the extreme case where all the tasks in the same stage are grouped into the same bundle, the above statement still holds. This leads to the theorem below. Theorem 3.3 With P processors, the expected completion time for a set of task embedded in a Proof. (Omitted. Cf. [11]) From Lemma 3.2 and Theorem 3.3, the difference of the time bounds is (S - T ). For a given task graph, if all the tasks have identical execution time, we have (S - T ) = 0. This means that performance of the bundle tree is optimal if all tasks have identical task execution time. 4 Discussion and Conclusion We have presented a provably efficient approach for static task scheduling based on the Cilk model. It has been shown theoretically that the quality of the schedule generated by using our approach is nearly optimal in certain cases. We have used this approach to successfully implement a parallel Make facility on the IBM SP2 machine and obtained some preliminary results. Our work represents one of the serious attempts to use Cilk to handle nontrivial types of concurrency (as required in the task scheduling problem), and the positive results have not only proven that the obvious is wrong, but also encouraged us to take the new tool seriously in future research. bundle tree is bounded by Tp = O(T 1 /P+S), where T 1

6 Acknowledgment We thank Charles Leiserson for his generous provision of the Cilk system and the fun of doing research using the great system. References [1] Hesham El-Rewini, Theodore G. Lewis, Hesham H. Ali. "Task Scheduling in Parallel and Distributed Systems," Prentice Hall, [2] Robert D. Blumofe, Christopher F.Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. "Cilk: An efficient multithreaded runtime In Proceedings of the 35 th Annual Symposium on Foundations of Computer Science, pages , Santa Fe, New Mexico, November [6] Michael R. Garey, David S. Johnson. "Computers and Intractability: A Guide to the Theory of NP-Completeness," W. H. Freeman and Company, New York, [7] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. "An analysis of dag-consistent distributed shared-memory algorithms." In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Padua, Italy, June system." In Proceedings of the Fifth ACM SIGPLAN [8] Robert E. Tarjan, "On the efficiency of a good but not Symposium on Principles and Practice of Parallel Programming (PPoPP), pages , Santa Barbara, California, July [3] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. "Dagconsistent distributed shared memory." In Proceedings of the 10 th International Parallel Processing Symposium, Honolulu, Hawaii, April linear set merging algorithm," J. ACM 22:2, pages , [9] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, "Introductions to Algorithms," MIT Press, [10] Robert E. Tarjan, "Data Structures and Network Algorithms," SIAM, Philadelphia, [11] Vee Voon Yee, Provably Efficient Task Scheduling by Using Cilk, HYP, NTU-SAS [4] Robert D. Blumofe. "Executing Multithreaded Programs Efficiently." Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September [5] Robert D. Blumofe and Charles E. Leiserson. "Scheduling multithreaded computations by work stealing."

A Minicourse on Dynamic Multithreaded Algorithms

Introduction to Algorithms December 5, 005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 9 A Minicourse on Dynamic Multithreaded Algorithms