Algorithms Design for the Parallelization of Nested Loops

Size: px

Start display at page:

Download "Algorithms Design for the Parallelization of Nested Loops"

Britton Simpson
5 years ago
Views:

1 NATIONAL TECHNICAL UNIVERSITY OF ATHENS SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT OF INFORMATICS AND COMPUTER TECHNOLOGY COMPUTING SYSTEMS LABORATORY Algorithms Design for the Parallelization of Nested Loops INTERMEDIATE PHD PROPOSAL Florina-Monica Ciorba Athens, 16 October 2006

3 NATIONAL TECHNICAL UNIVERSITY OF ATHENS SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT OF INFORMATICS AND COMPUTER TECHNOLOGY COMPUTING SYSTEMS LABORATORY Algorithms Design for the Parallelization of Nested Loops INTERMEDIATE PHD PROPOSAL of Florina-Monica Ciorba Computer Engineer, University of Oradea, Romania (2001) 3 Member Advisory Committee: G. Papakonstantinou P. Tsanakas N. Koziris Was approved by the 5 Member Examining Committee on G. Papakonstantinou P. Tsanakas N. Koziris Professor NTUA Professor NTUA Assistant Professor NTUA A. G. Stafilopatis P. Constantinou Professor NTUA Professor NTUA Athens, 16 October 2006

4 È ÖÐ Ý Ç ÓÔ ÙØ Ø Ö Ò Ö ÔÓ ÓØ ôò Ð ÓÖ ÑÛÒ Ñ ÛÒ Ø Ò Ô Ö¹ ÐÐ ÐÓÔÓ ÛÐ Ñ ÒÛÒ Ö ÕÛÒº ÍÔ ÖÕÓÙÒ Ó ÔÖÓ ØÓ Ô Ö Ô ÒÛ ÔÖ Ð Ñ ÔÖÓ ÐÓ Ñ Ó ÔÓÙ Ñ Ò Ø ÓÐ Ð Ö Ø ÔÖÓ Ö ÑÑ Ø ÔÓÙ Ô Ö ÕÓÙÒ ÓÐ Ñ ¹ ÒÓÙ Ö ÕÓÙ Ô Ö ÐÐ ÐÓÔÓ Ó ÒØ Ñ Ø ÕÖ ØÛÒ Ù Ø ÛÒ ÙÔÓÐÓ ØôÒ ÔÖÓ ÙÐ Ó ¹ÐÓ Ñ Ó ÔÓÙ Ñ Ò Ø Ò ÔÖ Ö ÑÑ ÕÛÖÞ Ø Ø ô Ø ØÓ Ñ ÖÓ ÔÓÙ ÑÔ Ö ¹ Õ ÛÐ Ñ ÒÓÙ Ö ÕÓÙ Ò Ô ÓÒÞ Ø ÙÐ ÙÒ Û Ô Ò ÔÖÓ Ö ÑÑ Ø Þ Ñ ÒÓ ÙÐ Ð FPGAsµ Òô ØÓ ÙÔ ÐÓ ÔÓ ØÓÙ ÔÖÓ Ö ÑÑ ØÓ Ø Ð Ø ÐÓ Ñ Ñ Ò Ó ÓÔÓ ÃÅ º ³Ç ÓÒ ÓÖ Ø Ò ÔÖÓ ÐÓ Ñ Ó Õ Ø Ò Ó Ø ÔÓ Ð ÓÖ ÑÛÒ Ø Ø ÖÓÑÓÐ ÙÒ Ñ ÖÓÑÓÐ º ÈÖôØ Õ Ø Ò Ð Ö ÑÓ ØÓÒ ÒØÓÔ Ñ ØÓÙ Ð Õ ØÓÙ Ö ÑÓ Ô Ü Ö ØôÒ ÔÓÙ Ô ØÓ ÒØ Ñ Ø Ø ÖÓÑÓÐ [AKC + 03b]º ËØ ÙÒ Õ Õ Ø Ò ÐÐÓ Ø Ø Ð Ö ÑÓ [AKC + 03a] Ø Ò ÔÓ ÓØ Ò ØÛÒ ÙÔÓÐÓ ÑôÒ ÕÖ ÑÓÔÓ ôòø ØÓÒ Ð Õ ØÓ Ö Ñ Ô Ü Ö Ø ôò ØÓ Õ ÛÒ ÔÓÙ Ù Ø ØÓÒ ÐØ ØÓ ÕÖ ÒÓ Ø Ð º Ç Ô Ñ Ò ÔÖÓ Ô Ó Ò Ñ ÔÖÓ ÖÑÓ Ø Ù Ð Ñ Ó Ó ÖÓÑÓÐ adaptive cyclic schedulingµ [CAP05] ÓÔÓ Ñ Ø ÐÐ Ø Ø ÛÑ ØÖ Ø Ø ØÓÙ ÕôÖÓÙ ØÛÒ Ô Ò Ð Ý ÛÒ ÔÖÓ Ñ ÒÓÙ Ò Ñ ÔÓ ÓØ ÛÑ ØÖ ÔÓ Ò ØÓÙ ÕôÖÓÙ ÙØÓ º ÙØ Ó Ð Ö ÑÓ ÛÖ ¹ Ø Ô Ø Ø º ³ Ò Ô Ö ÑÓ Ó Ø Ø Ð Ö ÑÓ ÖÓÑÓÐ ÔÓÙ Ñ Ø ÐÐ Ø Ô Ø Ò ÓÑÓ ÓÑÓÖ ØÛÒ ÕôÖÛÒ Ô Ò Ð Ý ÛÒ ØÛÒ ÛÐ Ñ ÒÛÒ Ö ÕÛÒ Õ Ø ÓÒÓÑ Ø ÖÓÑÓÐ Õ Ñ ÐÙ ÛÒ chain pattern schedulingµ [CAD + 05]º Ö Ø Ö¹ Ø Ð Ò ÔÖÓ Ô Ø Ò ÙØ Ñ Ø Ô Ö Û ØÓÙ Ô Ö ÐÐ ÐÓÙ ô Ó Ò ÑÓÙ Ñ ØÓ ÖÕ ÓÐÓÙ ÔÖ Ö ÑÑ Ñ ÛÐ Ñ ÒÓÙ Ö ÕÓÙ ô Ø Ò Ø Ð Ø Ø Ö¹ Õ Ø ØÓÒ ÑÓ Ö Þ Ñ Ò Ø Ò Ñ Ñ Ò ÑÒ Ñ [CAK + 03, ACT + 04]º Ç Ð Ö ÑÓ ÖÓ¹ ÑÓÐ ÔÓÙ Õ Ø Ò Ò ÙÒ Ñ Ó Ø Ò Ø Ñ Ó Ó ØÓÙ ÙÔ Ö Ô Ô ÓÙ hyperplane methodµ Ø Ñ ÓÙ ÙÔÓÐÓ Ø ÛÑ ØÖ º Å Ò Ñ Ó Ó Ô ÒÓ ÔÓÙ ÓÒÓÑ Ø ÙÒ Ñ ÔÓÐÙ ÖÓÑÓÐ [CAR + 05, PRA + 06] Û ÔÖôØ ÔÖÓ Ô ÔÖÓ Ø Ò Ô Ö ÐÐ ÐÓÔÓ ØÛÒ ÛÐ Ñ ÒÛÒ Ö ÕÛÒ ÕÖ ÑÓÔÓ ôòø Ñ ÕÓÒ ÖÓ Ó ô ¹ ÕÓÒ ÖÓ ÓÑÑ Ò µ ÔÖÓ Ø Ø ÖÓ Ò Ù Ø Ñ Ø ÙÔÓÐÓ ÑÓ º Ì ÙØ ÕÖÓÒ ÒÓÒØ ÔÖÓ Ô Ø Ò ÖÑÓ ØÓÙ Ð Ö ÑÓÙ ÖÓÑÓÐ Ñ ÑÓØ Ó ÐÙ ÛÒ Ñ ÔÖÓ ¹ ÙÐ Ó ¹ÐÓ Ñ Ó º À Ñ ÐÐÓÒØ Ö Õ Û Ø ÕÓ Ø Ò ÖÑÓ Ô Ö ÓØ ÖÛÒ ÐÛÒ ØÛÒ ÒÛØ ÖÛ Ð ÓÖ ÑÛÒ Ñ Ó ÓÐÓ ôò Ø Ò ÔÖÓ ÙÐ Ó ¹ÐÓ Ñ Ó º Ä Ü Ð ÛÐ Ñ ÒÓ Ö ÕÓ Ø Ø ÖÓÑÓÐ ÙÒ Ñ ÖÓÑÓÐ Ù Ø ¹ Ñ Ø ÑÓ Ö Þ Ñ Ò ÑÒ Ñ Ù Ø Ñ Ø Ø Ò Ñ Ñ Ò ÑÒ Ñ ÓÑÓ Ò Ù ØÓ Õ Ø ÖÓ¹ Ò Ù ØÓ Õ º

5 Abstract The purpose of this work is finding efficient algorithms and methods for the parallelization of nested loops. There are two approaches to the above problem: the software approach, which means, parallelizing whole programs containing nested loops with the use of computer clusters, and the hardware-software approach, which means a program is partitioned such that the part containing the nested loops is mapped to hardware (usually reconfigurable FPGAs) whereas the rest of the program is executed in software on a general purpose CPU. With respect to the software approach two types of algorithms were designed: static scheduling and dynamic scheduling algorithms. At first, an algorithm for determining the minimum number of processors required for a static schedule was determined [AKC + 03b]. Soon afterward, another static algorithm [AKC + 03a] for the efficient assignment of computations onto the minimum number of processing elements, guaranteeing the optimal makespan, was designed. Subsequent efforts resulted in an adaptive cyclic scheduling method [CAP05], that exploits the geometric properties of the iteration space in order to arrive at an efficient geometric decomposition of the iteration space. This algorithm is also considered to be a static scheduling one. A similar static scheduling algorithm, that also exploits the regularity of nested loops iteration spaces was designed and named chain pattern scheduling [CAD + 05]. Later on, efforts were made towards automatically producing the equivalent parallel code from a given sequential program with nested loops, for shared and distributed memory architectures [CAK + 03, ACT + 04]. The scheduling algorithms used were dynamic, and based on the hyperplane method and on computational geometry methods. A new scheme was devised, called dynamic multi-phase scheduling [CAR + 05, PRA + 06], as a first effort towards scheduling nested loops using a coarse grained approach, on heterogeneous clusters. Work is underway to apply the chain pattern scheduling algorithm in a hardware-software approach as well. Future work will have as focus the application of more or all of the above algorithms and methodologies to the hardware-software approach. Keywords: nested loops, static scheduling, dynamic scheduling, shared memory systems, distributed memory systems, homogeneous clusters, heterogeneous clusters.

7 Contents List of figures List of tables x xi 1 Introduction The problem of task scheduling Overview of the bibliography Publications Acknowledgement Background Algorithmic model Iterations space & precedence constraints Static scheduling algorithms Scheduling using a binary search decision algorithm Computing the ECT and LCT subsets The binary scheduling decision algorithm Complexity of the BSDA algorithm Experimental results Dynamic lower bound on the number of processors Computing the lower and upper bound of processors Implementation and experimental results Adaptive cyclic scheduling Adaptive cyclic scheduling description ACS on homogeneous systems ACS on heterogeneous systems Chain pattern scheduling Chain pattern scheduling Implementation and performance results Dynamic scheduling algorithms Scaled GRIDs and parallel code generation Scaled GRIDs Dynamic scheduling policy

8 viii CONTENTS Parallel code generation Implementation and experimental results Successive dynamic scheduling and parallel code generation for general loops Finding the optimal scheduling hyperplane using convex hulls Lexicographic ordering on hyperplanes Overview of Cronus/ Implementation and experimental results Dynamic multi-phase scheduling for heterogeneous clusters Motivation for dynamic scheduling of loops with dependencies Dynamic scheduling for loops with dependencies The DMPS(x) algorithm Implementation and test results Interpretation of the results Conclusions Future directions Bibliography 69

9 List of Figures 2.1 Algorithmic model The geometrical representation of a 2D Cartesian iteration index space with 5 uniform precedence constraints The geometrical representation of the ECT, LCT and CR sets for Example The results of scheduling 2D, 3D, 4D, 5D UDLs with ONP [LB,UB] The ECT sets for Example The results of scheduling 2D UDLs: left - without the LB 3, right - with the LB The results of scheduling 3D UDLs: left - without the LB 3, right - with the LB The results of scheduling 4D UDLs: left - without the LB 3, right - with the LB The results of scheduling 5D UDLs: left - without the LB 3, right - with the LB Geometrical representation of the index space, the regions, the cones, the cone vectors, the communication vector, some chains and the optimal hyperplane of Example ACS chain assignment on a homogeneous system with 5 slaves The patterns, pattern vectors and chains of a general UDL with 4 dependence vectors Scenario (1) Unbounded N P high communication moderate performance: every chain is mapped to a different processor (unrealistic assumption with respect to the number of available processors) Scenario (2) Fixed N P Moderate Communication (cyclic mapping): chains are mapped to the available processors in a cyclic fashion starting with chain C Scenario (2) Fixed N P Moderate Communication (mapping with offset 3 along the x axis, and offset 2 along the y axis): chains are mapped to the available processors so as to minimize communication imposed by d c, d 1 and d Experimental results, CPS vs cyclic, on NP = Experimental results, CPS vs cyclic, on NP = Experimental results, CPS vs cyclic, on NP = Experimental results, CPS vs cyclic, on NP = Geometrical representation of the (a) iteration index space J and (b) auxiliary index space J aux of Example

10 x LIST OF FIGURES 4.2 Speedup results for 7 testing examples Optimal hyperplane for two different index spaces, same dependence vectors, different terminal points Minimum and maximum points on hyperplanes Recursive function for computing min k,n+1 of a hyperplane Function for computing the successor on any hyperplane The organization of Cronus/ Six-level nested loop full search block matching motion estimation Speedup for FSBM, with frame size 1024 x Speedup for FSBM, with frame size 1280 x Speedup for FSBM, with frame size 600 x Speedup on random examples with 600 x 600 index space size Speedup on random examples with 800 x 800 index space size Synchronization points Chunks are formed along u 2 and SP are introduced along u State diagram of the slaves The dependence patterns for heat equation and Floyd-Steinberg Speedups for the heat equation on a dedicated & non-dedicated heterogeneous cluster Speedups for Floyd-Steinberg on a dedicated & non-dedicated heterogeneous cluster

11 List of Tables 3.1 Optimal schedule with 20 processors for the UDL of Example Execution evaluation for 2D, 3D, 4D, 5D UDLs and the percentage of binary searching ONP between LB and UB The SGRID examples used for testing Sample chunk sizes given for J = and m = Parallel execution times (sec) for heat equation on a dedicated & non-dedicated heterogeneous cluster Parallel execution times (sec) for Floyd-Steinberg on a dedicated & non-dedicated heterogeneous cluster

12 xii LIST OF TABLES

13 CHAPTER1 Introduction The need for parallel processing arises from the existence of time consuming applications in different areas, such as weather forecasting, nuclear fusion simulations, DNA and protein analysis, computational fluid dynamics, etc. Parallel processing comprises algorithms, computer architecture, programming and performance analysis. The process of parallelization consists of analyzing sequential programs for parallelism and restructuring them to run efficiently on parallel systems. When optimizing the performance of scientific and engineering sequential programs, the most gain come from optimizing nested loops or recursive procedures, where major chunks of computation are performed repeatedly. The goal of parallelization is to minimize the overall computation time by distributing the computational workload among the available processors. Parallelization can be automatic or manual. During automatic parallelization, the appropriate part of the compiler analyzes and restructures the program with little or no intervention by the user. The compiler automatically generates code that splits the processing of loops among multiple processors. An alternative is manual parallelization, in which the user performs the parallelization using compiler directives and other programming techniques. One of the complex challenges in parallel processing is mapping efficiently the computational workload to the parallel systems, taking into account all characteristics of the particular architecture. A key issue that must be carefully considered is minimizing the communication cost, in order to increase the performance of the parallel program. Communication may be required for every iteration of the loop (fine-grain parallelism approach) or for groups of iterations (coarse-grain parallelism approach). The iterations within a loop nest can be either independent iterations or precedence constrained iterations. The latter can be uniform (constant) or non-uniform throughout the execution of the program. This work examines the uniform precedence constraints case. When parallelizing loop structures for parallel computers, there are several tasks that need to be done: Detection of the inherent parallelism by applying (when necessary) any program transformation that may enable this. Computation scheduling (specifying when the different computations are performed). Computation mapping (specifying where the different computations are performed).

14 2 Introduction Explicitly managing the memory and communication (additional task for distributed memory systems). Generating the parallel code so that each processor will execute its apportioned computation and communication explicitly. The goal of parallel processing is to schedule tasks among the available processors to achieve some performance goal(s), such as minimizing communication costs and execution time, and/or maximizing resource utilization. In most cases, the term scheduling refers to the computation scheduling and mapping. Local scheduling performed by the operating system of a processor consists of the assignment of the processes to the time-slices of the processor. Global scheduling is the process of deciding where to execute a process in a multiprocessor system. It may be carried out by a single authority or it may be distributed among processing elements. In this work we focus on global scheduling, and we will refer to it simply as scheduling. There are two basic scheduling methods: static and dynamic. In static scheduling, the assignment of the processors is done before the program execution begins. Information regarding the task execution times and resources are assumed to be known at compile time itself. These methods are processor non-preemptive. The approach here is to minimize the execution time of the concurrent program while minimizing the communication delay. This methods aim at: Predicting the program execution behavior at compile time. Performing a partitioning of smaller tasks into coarser grain processes to reduce communication delays. Allocating processes to processors. Perhaps one of the most critical drawbacks of this approach is that generating optimal schedules is an NP-complete problem, hence only restricted solutions can be given. Heuristic methods rely on the rules of the thumb to guide the scheduling processes in the proper track for a near optimal solution. Static scheduling algorithms are in general easier to design and program. Dynamic scheduling is based on redistribution of tasks at execution time. This is achieved either by transferring the tasks from heavily loaded processors to the lightly loaded ones called load balancing, or by adapting the assigned number of tasks to match the workload variation in the multiprocessor system. Dynamic scheduling is necessary in situations where a static scheduling may result in a highly imbalanced distribution of work among processors or where the task-dependency graph itself is dynamic, thus precluding a static scheduling. The primary reason for using a dynamic scheduling algorithm is balancing the workload among processors, dynamic algorithms are often referred to as dynamic load-balancing algorithms. Dynamic scheduling algorithms are usually more complicated, particularly in the messagepassing programming paradigm. Dynamic scheduling algorithms can be either centralized or distributed (de-centralized). In a centralized dynamic load balancing approach, all ready-to-be-executed tasks are maintained in a common central data structure or they are maintained by a special processor or a subset of processors. If a special processor is designated to manage the pool of available tasks, then it is often referred to as the master and the other processors that depend on the master to obtain work are referred to as slaves. Whenever a processor has no work, it takes a portion of

15 3 available work from the central data structure or the master processor. Whenever a new task is generated, it is added to this centralized data structure or reported to the master processor. This describes the master-slave model. Centralized load-balancing schemes are usually easier to implement than distributed schemes, but may have limited scalability. As more and more processors are used, the large number of accesses to the common data structure or the master processor tends to become a bottleneck. In a distributed dynamic load balancing approach, the set of ready-to-be-executed tasks are distributed among processors which exchange tasks at run time to balance work. Each processor can send work to or receive work from any other processor. These schemes do not suffer from the bottleneck associated with the centralized schemes. However, it is more difficult to pair up sending and receiving processors for an exchange, to decide whether the transfer is initiated by the sender (an overburdened processor) or the receiver (a lightly loaded processor), to decide how much work is transferred in each exchange and to decide how often a work transfer should be performed. The load balancing operations may be centralized in a single processor or distributed among all processing elements. Combined policies are when information policy is centralized but the transfer and placement policies may be redistributed. In a distributed environment, each processing element has its own image of the system load. We can have a cooperative scheduling arrangement, where gradient distribution of the load is performed, or in a no cooperative setup, we use random scheduling which is especially useful, when the load for all the processing elements is high. The main advantage of the dynamic approach over the static approach is the inherent flexibility allowing for adaptation to the unforeseen application requirements at run-time, though at the cost of communication overheads. The disadvantages are mainly due to communication delays, load information transfer and decision making overheads. To achieve performance on parallel systems, programs must take into account both parallelism and data locality. There are two main classes of memory architectures for parallel machines: distributed address space machines and shared address space machines. The distributed address space machines are built from a group of processors connected via a communication network. Each processor has its own local memory that it can access directly; messages have to be sent across the network for a processor to access data stored in another processors s memory. The most popular example is a cluster of workstations (COW). In contrast, the shared address space machines present the programmer with a single memory space that all processors can access. Many of these machines support hardware cache coherence to keep data consistent across the processors. Shared address space machines can either have a single share memory that can be uniformly accessed by all the processors, or the memories may be physically distributed across the processors. The former class of machines are often called centralized shared address space machines, while the latter class of machines are called distributed shared address space machines. Distributing the memories across processors removes the need for a single shared bus going from all the processors caches to the memory and thus makes it easier to scale to more processors. An example of distributed shared address space systems are the Silicon Graphics Origin commercial machines. The problem of scheduling for distributed-memory systems is more challenging than for shared-memory systems due to the fact that the communication cost must be taken into account. Two things make the communication in multiprocessor systems more inefficient than in uniprocessors: long latencies due to inter-processor communication and multiprocessorspecific cache misses on machines with coherent caches. Long memory latencies mean that

16 4 Introduction the amount of inter-processor communication in the program is a critical factor for performance. Thus it is important for computations to have good temporal locality. A computation has good data locality if it re-uses much of the data it has been accessing; programs with high temporal locality tend to require less communication. It is therefore important to take communication and temporal locality into consideration when deciding how to parallelize a loop nest and how to assign the iterations to processors. Distributed systems can be classified as homogeneous and heterogeneous. A homogeneous system is composed of identical processors, having equal processing and communicating speeds. Obviously, a heterogeneous system consists of non-identical processors, with either different processing speeds (processor heterogeneity) or communicating speeds (network heterogeneity), or both. Processor heterogeneity arises from having processors with a different architecture, i.e., different processor types, memory types etc. Network heterogeneity arises when the speed and bandwidth of communication links between different pairs of processors differs significantly. The problem of optimal distribution of tasks across a system with both processor & network heterogeneity is much more difficult that across a cluster of heterogeneous processors interconnected with a homogeneous high-performance communication network. Both homogeneous and heterogeneous systems can be either dedicated, meaning that processors are dedicated to running the program and no other loads are interposed during the execution, or non-dedicated, meaning the system may be used by other users as well, affecting the performance of the parallel program. Both cases are examined in this work, for homogeneous systems, as well as for heterogeneous systems. 1.1 The problem of task scheduling In fine-grain parallel approaches, the way computations are mapped to processors is explicit in the algorithm. In other words, the algorithms explicitly specifies what each processor should do in any given computation step. With coarse-grain parallel approaches, it is possible to decouple the algorithm s specification from the assignment of tasks to processors. This is in fact required in dynamic load balancing, where the number of available processors is determined at compile time or even run time. The task scheduling problem [Par99]: Given the set of tasks of a parallel computation, determine how the tasks can be assigned to processing resources (scheduled on them) to satisfy certain optimality criteria. The set of tasks is usually defined in the form of a directed graph, with nodes specifying computational tasks and links corresponding to data dependencies or communications. However, it can also be defined in the form of a Cartesian space, where points of this space specify computational tasks and the vectors define the precedence constraints. Optimality criteria may include minimizing execution time, maximizing resource utilization, minimizing interprocessor communication, meeting deadlines, or a combination of these factors. Associated with each task is a set of parameters, including one or more of the following: 1. Execution or running time. We may be given the worst case, average case, or the probability distribution of a task s execution time. 2. Creation. We may be faced with a fixed set of tasks, known at compile time, or a probability distribution for the task creation times. 3. Relationship with other tasks. This type of information may include criticality, priority order, and/or data dependencies. 4. Start or end time. A task s release time is the time before which the task should not be

17 1.2 Overview of the bibliography 5 executed. Also, a hard or soft deadline may be associated with each task. A hard deadline is specified when the results of a task become practically worthless if not obtained by a certain time. A soft deadline may penalize late results but does not render them totally worthless. Resources or processors on which tasks are to be scheduled are typically characterized by their ability to execute certain classes of tasks and by their performance or speed. Often uniform capabilities are assumed for all processors, either to make the scheduling problem tractable or because parallel systems of interest do in fact consist of identical processors, called homogeneous systems. In this latter case, any task can be scheduled on any processor. The scheduling algorithm or scheme is itself characterized by: 1. Preemption. With nonpreemptive scheduling, a task must run to completion once started, whereas with preemptive scheduling, execution of a task may be suspended to accommodate a more critical or higher-priority task. In practice, preemption involves some overhead for storing the state of the partially completed task and for continually checking the task queue for the arrival of higher-priority tasks. However, this overhead is ignored in most scheduling algorithms for reasons of tractability. 2. Granularity. Fine-grain, medium-grain, or coarse-grain scheduling problems deal with tasks of various complexities, from simple multiply add calculations to large, complex program segments, perhaps consisting of thousands of instructions. Fine-grain scheduling is often incorporated into the algorithm, as otherwise the overhead would be prohibitive. Mediumand coarse-grain scheduling are not fundamentally different, except that with a larger number of medium-grain tasks to be scheduled, the computational complexity of the scheduling algorithm can become an issue, especially with on-line or run-time scheduling. Unfortunately, most interesting task scheduling problems, some with as few as two processors, are NP-complete. This fundamental difficulty has given rise to research results on many special cases that lend themselves to analytical solutions and to a great many heuristic procedures that work fine under appropriate circumstances or with tuning of their decision parameters. Stankovic et al. [SSDNB95] present a good overview of basic scheduling results and the boundary between easy and hard problems. El-Rewini et al. [ERAL95] provide an overview of task scheduling in multiprocessors. Polynomial time optimal scheduling algorithms exist only for very limited classes of scheduling problems. Examples include scheduling tree-structured task graphs with any number of processors and scheduling arbitrary graphs of unit-time tasks on two processors [ERAL95]. Most practical scheduling problems are solved by applying heuristic algorithms. 1.2 Overview of the bibliography Traditionally, software has been written for serial computation. The serial program was to be run on a single computer having a single CPU, the problem was broken into a sequence of instructions, that were executed one after another, and only one instruction could be executed at any moment in time. In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. The parallel program is to be run using multiple CPUs. The problem is broken into discrete parts that can be solved concurrently. Each part is further broken down to a series of instructions, and instructions from each part execute simultaneously on different CPUs. The compute resources can include a single computer with multiple processors, an arbitrary number of computers connected by a network, or a combination of both. The computational problem usually demonstrates characteristics such as the ability to: be broken apart

18 6 Introduction into discrete pieces of work that can be solved simultaneously; execute multiple program instructions at any moment in time; solved in less time with multiple compute resources than with a single compute resource. When the discrete pieces of work are independent, then no communication is required for the parallel execution. These types of problems are often called embarrassingly parallel because they are so straight-forward. For example, imagine an image processing operation where every pixel in a black and white image needs to have its color reversed. The image data can easily be distributed to multiple tasks that then act independently of each other to do their portion of the work. However, most parallel applications are not quite so simple, and do require communication between different pieces of work. For example, a 3-D heat diffusion problem requires a task to know the temperatures calculated by the tasks that have neighboring data. Changes to neighboring data has a direct effect on that task s data. A dependence exists between program statements when the order of statement execution affects the results of the program. A data dependence results from multiple use of the same storage location(s) by different tasks. Dependencies are of crucial importance to parallel programming because they are one of the primary inhibitors to parallelism. Dependencies must therefore be thoroughly analyzed in order to decide upon the best way to parallelize the application [Ban88, Pug92, Mol93]. Usually the most computationally intensive part of a program can be attributed to the nested loops it contains, since they iterate many times over the same statements. The goal of efficient parallelization of nested loops programs is to minimize the makespan, i.e., the parallel execution time. The major issues of minimizing the makespan of a parallel program concern the parallel resources and the communication overhead. With regard to the former, parallelizing algorithms must be designed for a bounded number of resources, such that maximum resource utilization is achieved in most cases. As for the latter, the cost of communication between the existing resources must be minimized. Many methods have focused on accomplishing the optimal time scheduling, the first of which is the hyperplane method [Lam74]. The computations are organized into well defined distinct groups, called wavefronts or hyperplanes, using a linear transformation. The hyperplane method is applicable to all loops with uniform dependencies and all the points belonging to the same hyperplane can be concurrently executed. Darte [DKR91] proved that this method is nearly optimal. Moldovan, Shang, Darte and others applied the hyperplane method to find a linear optimal execution schedule using diophantine equations [MF86], linear programming in subspaces [SF91] and integer programming [DKR91]. The problem of finding a hypersurface that results in the minimum makespan was solved more efficiently in [PAD01]. All these approaches consider unit execution time for each iteration and zero communication for each communication step (UET model). Each loop iteration requires exactly the same processing time and there is no communication penalty for the transmission of data among the processors. Previous attempts of scheduling uniform dependence loops include the free scheduling method introduced in [KPT96]. Our goal is to schedule uniform dependence loops so that the optimal parallel processing time is achieved using the minimum number of processors. The problem of scheduling uniform dependence loops is a very special case of scheduling Directed Acyclic Graphs (DAGs). The general DAG multiprocessor scheduling with precedence constraints is known to be NP-complete [GJ79, Ull75] even when the number of processors is unbounded [PY90]. Many researchers have tackled the special cases of DAG scheduling [EFKR01, KPT96] hoping to come up with efficient polynomial algorithms. The computa-

19 1.2 Overview of the bibliography 7 tions scheduling problem in our approach is a special case of the general scheduling problem in the sense that the index space is a subset of the cartesian N n space 1. The index points are not just vertices of the graph, but rather have coordinates in this cartesian space. Furthermore, the precedence constraints are represented by a fixed and uniform set of dependence vectors, which is the same throughout the index space. In [AKC + 03a] we propose an algorithm for the efficient assignment of computations onto the minimum number of processing elements that guarantees the optimal makespan. The proposed algorithm is polynomial in the size of the index space and performs a binary search between a lower and an upper bound of the optimal number of processors. In [AKC + 03b] a new dynamic lower bound on the number of processors needed for scheduling is proposed, and the algorithm given in [AKC + 03a] to decide whether there exist a legal schedule with a specific number of processors, is used to verify that all nested loops with uniform dependencies can be legally scheduled with this lower bound of processors. In their efforts of minimizing the communication cost for the fine-grained approach, several methods for grouping neighboring chains of iterations have been proposed in [KN91, SC95], while maintaining the optimal linear scheduling vector [ST91, DGK + 00, TKP00]. The goal of partitioning the iteration space into chains of iterations is to annul the communication for iterations of the same chain and minimizing the inter-chains communication. Therefore, some (neighboring or non-neighboring) chains may be grouped and computed by the same processor, therefore reducing the overhead of inter-processor communication. An adaptive cyclic scheduling algorithm is proposed in [CAP05] for mapping groups of chains of iterations formed along a certain dependence vector to the same processor. The algorithm is suitable for the two most common network architectures: homogeneous and heterogeneous systems. Another algorithm, called chain pattern scheduling, is proposed in [CAD + 05] that outperform the cyclic mapping [MA01] of chains to processors. The main advantage of the chain pattern scheduling algorithm is that regardless of the underlying interconnection network of the parallel system, or the system s homogeneity or heterogeneity, it significantly reduces the communication overhead. Finally, another fine-grain dynamic algorithm, called successive dynamic scheduling, is proposed in [ACT + 04]. This algorithm schedules the iterations on the fly, along the optimal hyperplane. A semi-automatic tool (called Cronus/1) was created to produce the equivalent parallel code for message passing architectures. For the coarse-grained approach many researchers use the tiling transformation to reduce the communication cost. Loop tiling or loop blocking is a loop optimization used by compilers to make the execution of certain types of loops more efficient. It was proposed by Irigoin and Triolet in [IT88]. Loop tiling for parallelism [Xue00] is a static transformation and the time complexity of tiling is exponential in the size of the iteration space. The major issues addressed in tiling are: minimizing the communication between tiles [WL91], finding the optimal tile shape [BDRR, RS92, Xue97] and scheduling tiled iteration spaces onto parallel systems [DRR96, RRP03]. Another approach to coarse-grain parallelism is proposed in [CAK + 03], where an algorithm for online scheduling of coarse grain grids (called SGRIDs) on shared- and distributedmemory architectures is presented. A semi-automatic tool produces the near optimal equivalent parallel program assuming that a fixed number of processors is available. An important class of dynamic scheduling algorithms are the self-scheduling schemes: Chunk Self-Scheduling (CSS) [KW85], Guided Self-Scheduling (GSS) [PK], Trapezoid Self- Scheduling (TSS) [TN93], Factoring Self-Scheduling (FSS) [HSF92]. These algorithms are 1 n represents the number of nested loops.

20 8 Introduction coarse-grained and were devised for nested loops without dependencies executed on homogenous systems. Self-scheduling algorithms divide the total number of tasks into chunks, which are then assigned to processors (slaves). In their original form, these algorithms cannot handle loops with dependencies and do not perform satisfactory on non-dedicated heterogeneous systems. A first attempt to make self-scheduling algorithms suitable for heterogeneous systems was Weighted Factoring (WF) proposed in [?]. WF differs from FSS in that the chunks sizes are weighted according to the processing powers of the slaves. However, with WF processor weights remain constant throughout the parallel execution. Banicescu and Liu proposed in [BL00] a method called Adaptive Factoring (AF), that adjusts the processor weights according to timing information reflecting variations of slaves computation power. This was designed for time-stepping scientific applications. Chronopoulos et al. extended in [CABG] the TSS algorithm, proposing the Distributed TSS (DTSS) algorithm suitable for distributed systems. With DTSS the chunks sizes are weighted by the slaves relative power and the number of processes in their run-queue. Finally, in [CAR + 05] three well known dynamic schemes (CSS, TSS and DTSS) are extended by introducing synchronization points at certain intervals so that processors compute in pipelined fashion. A scheme that uses the extended algorithms was devised, called dynamic multi-phase scheduling (DM P S), and we apply it to loops with iteration dependencies, to be scheduled on heterogeneous (dedicated & non-dedicated) clusters. 1.3 Publications The publications this report originates from are given below, and they include the study and the conclusions that resulted. International journals George Papakonstantinou, Ioannis Riakiotakis, Theodore Andronikos, Florina M. Ciorba and Anthony T. Chronopoulos, Dynamic Scheduling for Dependence Loops on Heterogeneous Clusters, Journal of Neural, Scientific and Parallel Computing, 2006 (To appear). International conferences Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos, and George Papakonstantinou, Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments, To be presented at the 5th International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks (HeteroPar 06), September 25-28, Barcelona, Spain, Florina M. Ciorba, Theodore Andronikos, Ioannis Riakiotakis, Anthony T. Chronopoulos, and George Papakonstantinou, Dynamic Multi Phase Scheduling for Heterogeneous Clusters, Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 06), April 25-29, Rhodes Island, Greece, Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis, George Papakonstantinou, Reducing Communication via Chain Pattern Scheduling, Proceedings of the 4th IEEE Conference on Network Computing and Applications (NCA 05), Cambridge, MA, July, 2005.

21 1.4 Acknowledgement 9 Florina M. Ciorba, Theodore Andronikos, George Papakonstantinou, Adaptive Cyclic Scheduling of Nested Loops, Proceedings of the 7th Hellenic European Research on Computer Mathematics and its Applications (HERCMA 05), Athens, Greece, September, Theodore Andronikos, Florina M. Ciorba, Panagiotis Theodoropoulos, Dimitris Kamenopoulos and George Papakonstantinou. Code Generation for General Loops Using Methods from Computational Geometry, Proceedings of the IASTED Parallel and Distributed Computing and Systems Conference (PCDS 2004), pp , Cambrige, MA USA, November 9-11, Florina M. Ciorba, Theodore Andronikos, Dimitris Kamenopoulos, Panagiotis Theodoropoulos and George Papakonstantinou. Simple Code Generation for Special UDLs, In Proceedings of the 1st Balkan Conference in Informatics (BCI 03), pp , Thessaloniki, Greece, November 21-23, Theodore Andronikos, Marios Kalathas, Florina M. Ciorba, Panagiotis Theodoropoulos, George Papakonstantinou. An Efficient Scheduling of Uniform Dependence Loops, In Proceedings of the 6th Hellenic European Research on Computer Mathematics and its Applications (HERCMA 03), Athens, Greece, September 25-27, Theodore Andronikos, Marios Kalathas, Florina M. Ciorba, Panagiotis Theodoropoulos, George Papakonstantinou and Panagiotis Tsanakas. Scheduling nested loops with the least number of processors, In Proceedings of the 21st IASTED International Conference on Applied Informatics (AI 2003), pp , Innsbruck, Austria, February, Acknowledgement This intermediate PhD proposal was supported by the Greek State Scholarships Foundation (³Á ÖÙÑ ÃÖ Ø ôò ÍÔÓØÖÓ ôò,

22 10 Introduction

23 CHAPTER2 Background 2.1 Algorithmic model The focus of this work are perfectly nested for-loops. The iterations within a loop nest are either independent (called parallel loops) or precedence constrained (called loops with dependencies, or dependence loops). Furthermore, the precedence constraints can be uniform (constant) or non-uniform throughout the execution of the program. This work examines the uniform precedence constraints case. Perfectly nested loops with uniform precedence constraints have the form shown in Fig for (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) { Loop Body }... for (i n =l n ; i n <=u n ; i n ++) { }... S 1 (I);... S k (I); Figure 2.1: Algorithmic model The depth of the loop nest is n and determines the dimension of the iteration index space J = {j N n l r i r u r,1 r n}. The lower and upper bounds of the loop indices are l i and u i Z, respectively. Each point of this n-dimensional index space is a distinct iteration of the loop body. When the loop body LB(j) contains simple assignment statements, the loop nest is considered to be simple and the execution time of one iteration is identical for all iterations. The loop body LB(j), however, can contain general programs statements (S i (I)), that may include assignment statements, conditional if statements and repetitions such as for or while. In this case, the execution time of one iteration is not necessarily the same for all iterations. L = (l 1,...,l n ) and U = (u 1,...,u n ) are the initial and terminal points of the

24 12 Background index space. The cardinality of J, denoted J is n i=1 (u i l i + 1). Both types of loop nests are examined in this work. 2.2 Iterations space & precedence constraints A popular geometrical representation of a nested loop is the Cartesian space model, in which points of the Cartesian space have coordinates and represent iterations of the loop (or tasks), and the directed vectors represent the inter-iteration dependencies, such as precedence. This is illustrated in Fig The initial and terminal points are highlighted and 5 uniform precedence constraints are depicted. In this work we will refer to such directed vectors as uniform dependence vectors. The set of the p dependence vectors is DS = { d 1,..., d p }, p n. By definition, dependence vectors are always > than 0, where 0 = (0,...,0) and > is the lexicographic ordering. Any loop instance LB(j) depends on previous loop instance(s) as follows: LB(j) = f(lb(j d 1 ),...,LB(j d p )) (2.1) 16 y Terminal Point Initial Point x Figure 2.2: The geometrical representation of a 2D Cartesian iteration index space with 5 uniform precedence constraints In a fine grained approach, one loop iteration is assigned to one processor, and due to the existing dependence vectors and the mapping choice, communication may be needed for every iteration. Similarly, in a coarse grained approach, groups of iterations are assigned to the same processor, to increase the data locality and eliminate the communication incurred by some of the dependence vectors. Both approaches are examined in this work, while striving to minimize the makespan and the communication overhead.

25 CHAPTER3 Static scheduling algorithms The objective of scheduling is to minimize the completion time of a parallel application by properly allocating the tasks to the processors. As mentioned in the Introduction, the scheduling can be static or dynamic. This section gives a few static scheduling algorithms, aimed at efficiently scheduling uniform dependence loops. Static scheduling is usually done at compile time and the characteristics of a parallel program (such as task processing times, communication, data dependencies, and synchronization requirements) are (assumably) known before program execution. Common simplifying assumptions include uniform task execution times, zero inter-task communication times, contention-free communication, full connectivity of parallel processors, and availability of unlimited number of processors. These assumptions may not hold in practical situations for a number of reasons. The first two algorithms described in this section take a fined grained approach to the parallelizing problem, and give optimal time schedules that use the lower bound on the number of processors needed for scheduling. The last two algorithms, approach the problem again in a fine grain fashion, while striving at significantly reducing the communication overhead by enhancing data locality. The following notations and definitions are used throughout this chapter: UDL uniform dependence loop, a loop nest with uniform precedence constraints GRIDs a special category of UDLs with unitary dependence vectors that reside on the axes. PE stands for processing element i is the current moment OET the optimal execution time, or the least parallel time (assuming that a sufficient number of processors is available) LB the lower bound on the number of processors UB the upper bound on the number of processors NP a candidate number of processors ONP the optimal number of processors, the least number of processors required to achieve the OET and ONP [LB,UB].

26 14 Static scheduling algorithms ECT earliest computation time LCT latest computation time CR crucial points The index space J of a UDL can be organized into two disjoint sequences of time sets denoted ECT i and LCT i, 1 i OET. ECT i contains the points of J whose earliest computation time is i and LCT i contains the points of J whose latest computation time is i. ECT(j) and LCT(j) are the earliest and the latest computation time, respectively, of point j J. ECT i LCT i = CR i the set of crucial points that must be executed at moment i. By assuming that the computation begins always at moment 1 and without lack of generality, OET = ECT(U). The out degree of an index point j is the number of dependence vectors originating from j, that end up to points within the index space. All the points of ECT i are grouped according to their delay resulting in the sequence Di k of points having delay k. We call crucial all the points with delay zero (denoted Di 0 ). These points must be executed at moment i. U i the set of points that are unavailable for execution at the beginning of moment i. A i the set of points that are available for execution at the beginning of moment i. NA i the set of points that are (become) available for execution at the end of moment i. R i the set of points that were not executed at moment i due to lack of free processors, i.e., the points remaining from moment i. E h the total number of points that must be scheduled at this moment (containing points of the current hyperplane h & any other points remained unexecuted from previous hyperplanes). Hyperplane (wavefront) the set of points that can be executed at specific moment, due to the existence of dependence vectors. All points of the same hyperplane can be executed concurrently. J can be partitioned into disjoint time subsets called regions and denoted R k, k 0, such that R k (i.e., region k) contains the points whose earliest computation time is k, and designates the area between hyperplanes k and k + 1 as well as the points of hyperplane k. R 0 denotes the boundary (pre-computed 1 ) points. J can be further partitioned into cones, and from the hyperplane of every cone, an optimal hyperplane is determined using the QuickHull algorithm. 1 The pre-computed points designate iteration points that do not belong to the first quadrant of the index space, but represent initial values for specific problems.

27 3.1 Scheduling using a binary search decision algorithm 15 A cone is the convex subspace formed by q vectors d 1,...,d q N n, q p, and is denoted Con(d 1,...,d q ) = {j N n j = λ 1 d λ q d q, where λ 1,...,λ q 0}. Non-trivial cones, or just cones the ones defined exclusively by dependence vectors. Trivial cones cones defined by dependence vectors and at least one unitary axe vector. d c communication vector, usually chosen to be the dependence vector that incurs the largest amount of communication (in most cases the vector with the smallest absolute coordinates). J can also be partitioned into disjoint time subsets denoted Pat s, r 0, such that Pat s contains those points of J, whose earliest computation time is s. Pat s the pattern corresponding to moment s. Pat 0 denotes the boundary (precomputed) points. Pat 1 the initial pattern, i.e., the subset of index points that can be computed initially (at moment 1). The pattern-outline is the upper boundary of each Pat s and is denoted as pat s,s 1. The pattern points are those index points that are necessary in order to define the polygon shape of the pattern. The pattern vectors are those dependence vectors d i whose end-points are the pattern points of the initial pattern. 3.1 Scheduling using a binary search decision algorithm This section describes an algorithm for the efficient assignment of computations onto the minimum number of processing elements that guarantees the optimal makespan. The algorithm is called binary search decision algorithm (BSDA) and was published in [AKC + 03a]. BSDA is polynomial in the size of the index space and performs a binary search between a lower and an upper bound of the optimal number of processors. This algorithm aims at building an optimal schedule that achieves the optimal parallel processing time using the minimum number of processors. This algorithm follows a fine grained approach in a centralized manner. The addressed problem can be stated as follows: given a uniform dependence loop with a loop body consisting of simple assignment statements, we want to find a legal schedule 2 that results in the minimum makespan using the least number of processors. Based on a previous theory [APT96, KPT96], the optimal scheduling in terms of the number of processors while achieving the optimal makespan, is established. This approach is based on the unit execution zero communication model (UET) of UDLs. However, as it has been shown in [PAD01], a unit execution unit communication time (UET-UCT) UDL can be reduced to an equivalent UET UDL; hence, in that sense, this approach also applies to UDLs that consider the UET-UCT model. By ignoring the communication delays the criteria of optimality are restricted on the total number of processing elements. A preliminary version of BSDA was proposed in [APT96] and [PKA + 97]. 2 A schedule is legal iff it includes all index points and does not violate any precedence constraint.

28 16 Static scheduling algorithms Computing the ECT and LCT subsets Given a UDL, ECT 1 contains those points j that initially do not depend on other points. Every subsequent ECT i+1 set, 1 i OET 1, contains those points j that depend on points belonging to one of the previous ECT sets, at least one of which must be ECT i. After the ECT subsets are computed with any of the known methods, e.g. [DAK + 01, PAD01], the LCT of any index point j can also be computed using the following formula from [AKC + 03b]: LCT(j) = OET ECT(U j + L) + 1 (3.1) in one-pass. The CR subsets are formed by taking the intersection of the ECT and LCT sets. Example Consider a 2D index space representing the following two-level loop nest in unit-increment steps: for (i=1; i<=10; i++) { for (j=1; j<=10; j++) { A[i,j]=A[i-3,j-1]+ +3*A[i-4,j-2]+1; B[i,j]=2*B[i-2,j-2]-1; } } The index space J is {(i,j) N 2 1 i 10,1 j 10}, the initial point L is (1,1), the terminal point U is (10,10) and the dependence vectors are d 1 = (3,1),d 2 = (4,2),d 3 = (2,2). The dependence vectors express the fact that iteration j depends on the previous iterations j d 1,j d 2,j d 3 ; for instance, to compute A[i,j] we must have already computed A[i-3,j-1] and A[i-4,j-2]. The sequences of ECT and LCT sets and their intersection (CR sets) are depicted in Figure 1. The minimum makespan OET is determined to be 5. All the points of ECT 1 can be executed at moment 1. However, only those points that belong to CR 1 must be executed at moment The binary scheduling decision algorithm For any given n-dimensional UDL, we construct the sequences ECT i and LCT i, 1 i OET. Consider that for a point j ECT i, the earliest moment it can be executed is i. If j LCT i too, then the latest moment it can be executed is i. Therefore j is a crucial point and j CR i. No delay is possible for this point and the algorithm will execute j at moment i. However, if j LCT i+k, k > 0, then we could delay its execution until moment i+k, in which case point j would have delay k. The binary scheduling algorithm always schedules the crucial points at their appropriate moments, i.e., it schedules the points of CR i at moment i. If there are not enough processors to schedule all the crucial points at any moment, then the algorithm will return NO indicating it is impossible to to produce a legal schedule in OET using this number of processors. In case there are enough processors to schedule all the crucial points, and still have some free processors, we continue to schedule points with delay 1, i.e., k = 1. If the points with delay 1 are more than the free processors, then, in order to establish some kind of priority, they are sorted in increasing order of their earliest computation time, and those with the same

29 3.1.2 The binary scheduling decision algorithm 17 ECT1 ECT2 ECT3 ECT4 ECT LCT1 LCT2 LCT3 LCT4 LCT CR4 11 CR CR2 CR1 CR Figure 3.1: The geometrical representation of the ECT, LCT and CR sets for Example ECT value are sorted in decreasing order of their out degree. If some processors still remain free after all points with delay 1 have been assigned, we proceed to schedule the points with delay 2 on these processors, and so on. I. The binary scheduling algorithm description The binary scheduling decision algorithm (BSDA) consists of two parts: the binary scheduling algorithm (BSA) and the decision algorithm (DA). The BSA takes as input an algorithm A(J,DS) and computes the ECT, LCT and CR sets. It determines the LB and UB and performs a binary search between LB and UB. For every candidate value P that results from this search, the BSA calls the DA to decide whether there exists a legal schedule with P processors. Hence, the smallest natural ON P for which DA returns YES and the corresponding schedule are produced. Searching between the lower and upper bounds The lower bound on the number of processors is LB = max{ J OET,max{ CR i }}, 1 i OET. With fewer than LB processors there is no legal way of scheduling J points in OET moments. The upper bound on the number of processors is UB = max{ ECT 1,..., ECT OET }, where ECT i is the cardinality of ECT i. With UB processors we can legally schedule a UDL

30 18 Static scheduling algorithms achieving OET. A binary search is performed between the lower and upper bounds in order to narrow down, as fast as possible, the number of processors that suffice for the scheduling. The search goes as follows: Do LB+UB 2 suffice? Y ES NO ր ց LB+ LB+UB UB+ LB+UB LB+UB 2 goes on with processors are used in the beginning, and if a schedule is found, then the process LB+ LB+UB the process goes on with processors, getting closer to LB. If no schedule can be found, UB+ LB+UB processors, getting closer to UB, and so on, until the ONP that suffice for the schedule is reached. Example (continued from 3.1.1) In this example, J = 100, therefore J OET = 20. Since max{ CR i } = max{5,6,6,6,5} = 6, then LB = max{20,6} = 20. Similarly, UB = max{29,26,23,17,5} = 29. The binary search is performed between these bounds as follows: it starts with trying to find a legal schedule using = 24 processors. If a schedule is found, then the process goes on with trying with = 21 processors, and if there is no legal schedule with 21 processors, it tries with = 27 processors, and so on. II. The decision algorithm description The algorithm is basically a decision one, which takes as input a candidate number P of processors and decides if there is a schedule with P processors in OET. If the answer is YES it also outputs the schedule. The Decision Algorithm Input: ECT i, LCT i, 1 i OET, P, Output: Y ES (and a schedule) iff P PEs execute A(J,DS) in OET Initialization: A 1 = ECT 1 FOR i = 1 TO OET DO BEGIN N = P IF i = 1 THEN GOTO Step 2 Step 1: IF i= 2 THEN U 2 = {j ECT 2 i R 1 d DS j = i + d} ELSE U i = {j ECT i i (R i 1 U i 1 ) d DS j = i + d} U i 1 A i = (ECT i U i ) R i 1

31 3.1.2 The binary scheduling decision algorithm 19 Step 2: IF N < A i LCT i THEN GOTO Step 6 Assign a PE to each j A i LCT i N = N A i LCT i IF N > 0 A i LCT i+k, for some k, THEN GOTO Step 3 ELSE GOTO Step 4 Step 3: WHILE N > 0 DO k 0 is the smallest natural k such that A i LCT i+k IF A i LCT i+k0 > N THEN Sort A i LCT i+k0 in order of increasing ECT & decreasing out-degree Assign as many PEs as possible to j A i LCT i+k0 N = N A i LCT i+k0 ENDWHILE Step 4: IF i 1 THEN NA i = {j U i d DS, j d was executed or is out of the index space} U i = U i NA i R i = (A i {points executed at moment i }) NA i ELSE R 1 = A 1 {points executed at 1} ENDFOR Step 5: Return YES and STOP Step 6: Return NO and STOP Example (continued from 3.1.2) As determined earlier, LB = 20 and U B = 29. We want to see if the DA can produce a legal schedule with LB processors. When DA starts, at the beginning of each moment we have 20 available processors. We initialize A 1 = ECT 1 containing 29 points (see Figure 1). At moment 1: The first points we choose for scheduling are CR 1 = A 1 LCT 1. CR 1 = 5, so we assign them to the first 5 processors and N becomes 15. Then we choose the 11 points of A 1 LCT 1+1 and we assign them to the next 11 available processors, N becoming 4. Next we choose for scheduling the A 1 LCT 1+2 = 5 points and because N < A 1 LCT 1+2 we sort these points in increasing order of their ECT value, and those with the same ECT value in decreasing order of their out degree; we assign the first 4 points to the last 4 free processors and N becomes -1, which means there is one point remaining unexecuted from this moment which we include in R 1. At moment 2: Since we have one unexecuted point from the previous moment, the points at this moment that depend on it form the U 2 set, that contains 4 points. We calculate the set A 2 = (ECT 2 U 2 ) R 1 that contains 31 points, designating the points available for execution at this moment. The first points selected are A 2 LCT 2 = 6 points and we assign

32 20 Static scheduling algorithms them to the first 6 free processors, N becoming 14. Next we choose the A 2 LCT 2+1 = 13 points and we assign them to the next 13 available processors and N becomes 1. Then the A 2 LCT 2+2 = 7 points are selected and because N < A 2 LCT 2+2 these points are sorted as described at moment 1 and we schedule the first point to the last free processor and N becomes -6. We calculate the subset NA 2 that contains 3 points, designating those points initially belonging to U 2 that now became available for execution, because they do not depend anymore on any unexecuted points. We recalculate the subset of unavailable points at this moment, U 2 by subtracting the points of NA 2, resulting in one point. The 14 remaining points from this moment, included in R 2, are the points of A 2 that were not yet executed, along with the points of NA 2. We run the algorithm for each moment until we reach OET = 5. The resulting schedule is shown in the Table 3.1. Conclusion. The DA produced a legal schedule using LB processors; since there are no idle processors at any moment, maximum processor utilization is achieved. Table 3.1: Optimal schedule with 20 processors for the UDL of Example Time instance Processor # 0 (2,2) (3,3) (5,5) (7,7) (9,9) 1 (2,1) (3,4) (5,6) (7,8) (9,10) 2 (1,3) (3,5) (5,7) (7,9) (10,8) 3 (1,2) (4,2) (6,4) (8,6) (10,9) 4 (1,1) (4,3) (6,5) (8,7) (10,10) 5 (4,1) (4,4) (6,6) (8,8) (8,10) 6 (3,2) (3,6) (1,9) (8,5) (10,5) 7 (3,1) (3,7) (2,9) (8,4) (10,6) 8 (2,6) (3,8) (8,1) (8,3) (10,7) 9 (2,5) (4,5) (7,2) (6,9) (9,3) 10 (2,4) (4,6) (4,9) (5,9) (10,3) 11 (2,3) (4,7) (8,2) (10,1) (9,4) 12 (1,7) (4,8) (3,9) (9,1) (6,10) 13 (1,6) (5,2) (5,8) (2,10) (10,4) 14 (1,5) (5,3) (6,7) (1,10) (9,8) 15 (1,4) (5,4) (6,8) (4,10) (9,7) 16 (6,1) (6,2) (7,3) (9,2) (9,6) 17 (5,1) (6,3) (7,4) (10,2) (9,5) 18 (2,8) (1,8) (7,5) (3,10) (8,9) 19 (2,7) (7,1) (7,6) (5,10) (7,10) Complexity of the BSDA algorithm BSDA is linear in the size of J. The most computationally expensive step is the calculation of the two sequences of ECT i and LCT i sets. Due to the uniformity of the index space this can be done in linear time in the size of J. Computing the ECT subsets requires that we traverse the index space only once, whereas the LCT subsets are computed using eq.(??). The CR subsets are computed by intersecting each ECT with the corresponding LCT subset.

33 3.1.4 Experimental results 21 In Steps 1-6 of the DA every point of the index space is examined at most twice. This examination depends only on the number of dependence vectors p and the dimension of the index space n which are independent of the size of the index space J. Therefore, we conclude that each run of the DA can be done in Θ( J ) time. A trivial upper bound on the number of DA runs is log UB, which is bounded above by log J. Thus, the overall time complexity of BSDA is O( J log J ) Experimental results A tool, called BSDA-Tool, that generates random UDLs and then determines the optimal schedules was developed. We ran experiments with random UDLs in 2, 3, 4 and 5 dimensions, with two ranges for the number of dependence vectors: 3-6 and The results are shown in Table 3.2 and Fig , showing the percentage of UDLs that have ONP = LB, the percentage that have ONP within 10% of the LB, etc. While generating random UDLs, a special subclass of UDLs was identified: the scaled grids (SGRIDs). It includes those UDLs of which the dependence vectors reside on the axes and the value of the non-zero coordinate is an arbitrary positive integer, i.e., these dependence vectors are non-unitary. We observed that SGRIDs always require U B processors for an optimal schedule, if the size of the index space in each dimension is divisible with the corresponding coordinate of the dependence vector of this dimension. This means that the size of the index space in a certain dimension is a multiple of the size of the dependence vector residing on that axe. In order to produce the optimal schedule, the DA algorithm has to be run. The statistical analysis of the experimental results lead us to the following conclusions: (1) For SGRIDs, UB processors must be used. (2) For the vast majority of UDLs (55% to more than 80% in some cases), the ONP falls within 10% of the LB. Table 3.2: Execution evaluation for 2D, 3D, 4D, 5D UDLs and the percentage of binary searching ON P between LB and U B. dim.# dep.# LB 1-10 % % % UB % 9.2% 19.5% 6.9% 1.2% dim % 7.9% 26.1% 17.1% 2.9% % 6.3% 10.6% 8.8% 2.4% dim % 10.3% 16.1% 7.3% 1.9% % 6.6% 8.7% 7.4% 2.0% dim % 12.4% 13.9% 6.1% 1.9% % 7.6% 6.6% 6.9% 1.7% dim % 7.9% 15.2% 6.5% 2.1% 3.2 Dynamic lower bound on the number of processors This section describes a different approach to the problem addressed in the previous section, i.e., finding a legal schedule that results in the minimum makespan using the least number of processors. The approach described here, published in [AKC + 03b], is a fine grained and

34 22 Static scheduling algorithms Figure 3.2: The results of scheduling 2D, 3D, 4D, 5D UDLs with ON P [LB, U B] centralized scheme. A new dynamic lower bound on the number of processors needed for scheduling is presented, and the decision algorithm described above is employed to verify that all nested loops with uniform dependencies can be scheduled with this lower bound. The loops considered in this approach have uniform dependence vectors and their loop body consists of simple assignment statements. As before, the index space is organized into two sequences of disjoint time-subsets ECT i and LCT i, i 1, such that ECT i contains the points whose earliest computation time is i, and LCT i contains those points whose latest computation time is i. The LCT of an index point is computed using eq. (??). If the ECT and the LCT of an index point are both i, then we say that the point belongs to ECT i and has no delay. On the other hand, if the ECT of an index point is i but its LCT is i + 1, we say that this point belongs to ECT i but has a delay of one moment. Following this direction, we form partitions of ECT sets according to the delay corresponding to each point, i.e., those that have no delay in Di 0, those that have delay 1 in Di 1, and so on. The intersection of each ECT i with the corresponding LCT i is Di 0 and we call these points crucial Computing the lower and upper bound of processors For any given UDL, we may have three major lower bounds depending upon the cardinality of ECT 1 set. The most trivial one is LB 1 =. If ECT 1 > LB 1, in the first moment J OET we utilize all free processors and for most UDLs we achieve the OET using ONP = LB 1 due to the geometry of the index space that allows to obtain maximum processors utilization at each time step. If ECT 1 < LB 1 it is possible, however, that a legal schedule cannot be found with LB 1 processors in OET. On the other hand, at each moment we have the sequence of crucial points D 0 i. Therefore, another lower bound is LB 2 = max{ D 0 i }, since with fewer than LB 2

35 3.2.1 Computing the lower and upper bound of processors 23 processors we cannot produce a schedule in OET. Unfortunately, in many cases even this does not suffice. LB 2 is a static lower bound in the sense that it regards the number of crucial points as constant for each time step regardless of the number of processors, which is not an accurate assumption. For instance, if some of the points of Di 1 are not scheduled at moment i, they become crucial at moment i + 1, so the correct number of crucial points at moment i+1 is greater than the initially computed Di+1 0 points, therefore it is not a constant number. This partitioning is depicted in Fig for Example It is therefore clear that a better and improved lower bound is needed. The new lower bound (LB 3 ) described in this section takes into consideration the propagation of points with delays remained unexecuted from one time step to another. In order to determine LB 3 a sequence of linear inequalities of the form below must be solved: minp h N such that P h E h where 1 h OET, P h is the candidate number of processors required for hyperplane h and LB 3 = max{p h } For the first hyperplane, the candidate number of processors to schedule the crucial points of ECT 1 must satisfy the inequality: P 1 E 1 = D 0 1 (3.2) For every subsequent hyperplane h, we first check if the number of processors we have found so far P h 1 is sufficient. We do that by computing the total number of points that must be scheduled at this moment, using the formula: h 1 E h = Dh 0 + h k ( Dk i P h 1) (3.3) k=1 where h, 1 h OET, is the current hyperplane. If E h P h 1 then P h 1 is sufficient for the scheduling of the crucial points of the first h ECT sets at this moment also, consequently there is no need for more processors. In this case, we set P h = P h 1 and continue with the next hyperplane. On the other hand, if E h > P h 1 then we clearly need more processors and we compute the exact number of processors, P h, necessary for the current hyperplane as follows. We impose the condition: h 1 P h E h = Dh 0 + h k ( Dk i P h) (3.4) In order to solve meaningfully the above inequality we must examine h cases. Clearly P h Dh 0 and we have to consider whether P h h k i=0 Di k or P h > h k i=0 Di k, where 1 k r 1. So we compute the r 1 terms h k i=0 Di k, we order them in increasing order along with D0 h (so we have a sequence of h terms t 1... t h ) and we try to solve the inequality (3.4) initially assuming that t 1 P h t 2, and if we do not find a solution we assume that t 2 < P h t 3, and so on, until we find the first solution for (3.4). The worst case scenario is when P h > t h. We must stress the fact that in solving (3.4) the physical meaning of E h must be taken into consideration. For instance if h k i=0 Di k < P h, then in E h the term h k i=0 Di k P h is not taken as a negative integer, which would have no meaning, but is taken instead as zero, meaning that the index points belonging to this set were all scheduled and none remained. i=0 k=1 i=0

36 24 Static scheduling algorithms In the end, LB 3 is the last dynamically computed P h that sufficed for a legal schedule of all the index points of the nested loops. We take as the final LB = max{lb 1,LB 2,LB 3 }. We define UB = max{ ECT 1,..., ECT OET } as the trivial upper bound of processors and we say that with UB processors we can legally schedule this UDL achieving the OET. Example As an example, consider a 2D index space representing the following twolevel loop nest in unit-increment steps: for (i=1; i<=18; i++) { for (j=1; j<=18; j++) { A[i,j]=A[i-1,j-4]+3*A[i-4,j-1]+1; } } The ECT sets divided into subsets according to the delay value are depicted in Fig Hyperplane1 Hyperplane2 Hyperplane3 Hyperplane4 Hyperplane5 Hyperplane6 Hyperplane7 ECT ECT ECT ECT ECT ECT ECT7 0 7 For the first hyperplane we have the inequality: Figure 3.3: The ECT sets for Example P 1 D 0 1 = 9 so we consider P 1 = 9. For the second hyperplane we check if P 1 processors suffice using (3). E 2 = D (D0 1 + D1 1 P 1) = 37 hence E 2 > P 1, therefore we need more processors. We impose the condition: P 2 D (D D 1 1 P 2 ) P (28 P 2 ) In order to solve the above inequality we must examine the case where 18 P 2 28 and we try to find a solution by solving the equation: P 2 = 18 + (28 P 2 ) P 2 = 23 which satisfies the condition. So we proceed to the third hyperplane and check if P 2 processors suffice. E 3 = D (D0 1 + D1 1 + D2 1 P 2) + (D D1 2 P 2) = 60

37 3.2.1 Computing the lower and upper bound of processors 25 hence E 3 > P 2, therefore we need again more processors. We impose the condition: P 3 D (D0 1 + D1 1 + D2 1 P 3) + (D D1 2 P 3) P (36 P 3 ) + (43 P 3 ) In order to solve the above inequality we must examine the case where 27 P 3 36 and we try to find a solution by solving the equation: P 3 = 27 + (36 P 3 ) + (43 P 3 ) P 3 = 36 which satisfies the condition. We go on the fourth hyperplane and we check if P 3 processors suffice. E 4 = D (D D D D 3 1 P 3 ) + (D D D 2 2 P 3 ) + (D D 1 3 P 3 ) = 63 hence E 4 > P 3 and again we need more processors. We impose the condition: P 4 D (D0 1 + D1 1 + D2 1 + D3 1 P 4) + (D D1 2 + D2 2 P 4) + (D D1 3 P 4) P (38 P 4 ) + (45 P 4 ) + (52 P 4 ) In order to solve the above inequality we must examine the case where 36 P 4 38 and we try to find a solution by solving the equation: P 4 = 36 + (38 P 4 ) + (45 P 4 ) + (52 P 4 ) P 4 = 43 which does not satisfy the condition. Therefore we must examine the case where 38 < P 4 45 and we solve the same equation as above, taking the first parenthesis as zero instead of a negative integer: P 4 = 36 + (45 P 4 ) + (52 P 4 ) P 4 = 45 which satisfies the condition. Next we go to the fifth hyperplane and check if P 4 processors suffice. E 5 = D 0 5+(D 0 1+D 1 1+D 2 1+D 3 1+D 4 1 P 4 )+(D 0 2+D 1 2+D 2 2+D 3 2 P 4 )+(D 0 3+D 1 3+D 2 3 P 4 )+(D 0 4+D 1 4 P 4 ) = 49 hence E 5 > P 4 and again we need more processors. We impose the condition: P 5 D 0 5+(D 0 1+D 1 1+D 2 1+D 3 1+D 4 1 P 5 )+(D 0 2+D 1 2+D 2 2+D 3 2 P 5 )+(D 0 3+D 1 3+D 2 3 P 5 )+(D 0 4+D 1 4 P 5 ) P (40 P 5 ) + (47 P 5 ) + (54 P 5 ) + (61 P 5 ) and first we examine the case where 27 P 5 40 and we try to find a solution by solving the equation: P 5 = 27 + (40 P 5 ) + (47 P 5 ) + (54 P 5 ) + (61 P 5 ) P 5 = 46 which does not satisfy the condition. Therefore we must examine the case where 40 < P 5 47 and we solve the same equation as above, taking the first parenthesis as zero: P 5 = 27 + (47 P 5 ) + (54 P 5 ) + (61 P 5 ) P 5 = 48 which does not satisfy the condition. And again we have to examine the next case which is 47 < P 5 54 and we solve the same equation as above, taking the first two parentheses as zero: P 5 = 27 + (54 P 5 ) + (61 P 5 ) P 5 = 48 which this time satisfies the condition. We proceed with the sixth hyperplane and check whether P 5 processors suffice. E 6 = D (D0 1 + D1 1 + D2 1 + D3 1 + D4 1 + D5 1 P 5) + (D D1 2 + D2 2 + D3 2 + D4 2 P 5) + (D D1 3 + D D3 3 P 5) + (D D1 4 + D2 4 P 5) + (D D1 5 P 5) = 40 hence E 6 < P 5. Since we can schedule all the point of this hyperplane with P 5 processors, we do not need to estimate a new value for the number of processors and we consider P 6 = P 5. On the last hyperplane, we check if P 6 processors suffice. E 7 = D 0 7 +(D0 1 +D1 1 +D2 1 +D3 1 +D4 1 +D5 1 +D6 1 P 6)+(D 0 2 +D1 2 +D2 2 +D3 2 +D4 2 +D5 2 P 6)+(D D D2 3 + D3 3 + D4 3 P 6)+(D D1 4 + D2 4 + D3 4 P 6)+(D D1 5 + D2 5 P 6)+(D D1 6 P 6) = 36 hence E 7 < P 6. Since we can schedule also all the points of this hyperplane with P 6 processors we consider P 7 = P 6 and ONP required for the optimal schedule is the last computed P h, hence LB = P 7 = 48.

38 26 Static scheduling algorithms Implementation and experimental results In order to verify that all UDLs can achieve the proposed dynamic LB = max{lb 1,LB 2,LB 3 } we use the DA described in the previous section. BSDA-Tool was used to generate random UDLs and to determine if an optimal schedule can be found for the candidate number of processors. We ran experiments with random UDLs in 2, 3, 4 and 5 dimensions. In each case, we compare the results obtained with LB = max{lb 1,LB 2 } with the one obtained with LB = max{lb 1,LB 2,LB 3 }. The results are shown in Fig , showing the percentage of UDLs that have ONP = LB, the percentage that have ONP within 10% of the LB, etc. Figure 3.4: The results of scheduling 2D UDLs: left - without the LB 3, right - with the LB 3 Figure 3.5: The results of scheduling 3D UDLs: left - without the LB 3, right - with the LB 3 The results show that all UDLs are scheduled using LB = max{lb 1,LB 2,LB 3 } processors, in comparison with the results before introducing LB 3. The statistical analysis of the experimental results lead us to the conclusion that we can schedule every random UDL using the proposed lower bound. This means we can know apriori the number of processors needed for the optimal schedule and we can start building the schedule using the appropriate lower bound, also reducing the scheduling costs from the hardware point of view. Also, for SGRIDs, the subclass of UDLs identified in the previous section, U B actually coincides with LB 3. Moreover, if the size of the index space in each dimension of an SGRID UDL is divisible with the corresponding coordinate then every index point is crucial and LB 2 = UB.

39 3.3 Adaptive cyclic scheduling 27 Figure 3.6: The results of scheduling 4D UDLs: left - without the LB 3, right - with the LB 3 Figure 3.7: The results of scheduling 5D UDLs: left - without the LB 3, right - with the LB Adaptive cyclic scheduling This section describes another static scheduling algorithm, called adaptive cyclic scheduling (ACS) and published in [CAP05]. ACS exploits the geometric properties of the iterations space in order to arrive at an efficient geometric decomposition of the iterations space. This algorithm is also a centralized scheme with fine grained parallelism approach. Again, the loops considered in this approach have uniform dependence vectors and their loop body consists of simple assignment statements. ACS is a based on [DAK + 02] and [CAK + 03]. In [DAK + 02] a nearly optimal solution to the scheduling problem with communication cost, which attains a makespan of constant delay from the earliest possible is given. In [CAK + 03], the index space of the nested loops is replicated with patterns of independent points, and a new index space is obtained by mapping every such pattern into a iteration point of the new index space; the scheduling is performed along hyperplanes of the new index space, in a round-robin fashion among the available processors. In contrast with the two, ACS accounts for the communication cost, while being suitable for the two most common network architectures: homogeneous and heterogeneous systems. The index space of a UDL is partitioned into regions. J is further divided into cones, according to the existing dependence vectors. Using the QuickHull algorithm [BDH96], from the hyperplane of every cone an optimal scheduling hyperplane is determined, that

40 28 Static scheduling algorithms best exploits the available parallelism. These partitions are illustrated in Fig. 3.8 for the Example y Y 01 Hyperplane Y 01 Con(d1,d2) Hyperplane Con(d1,d2) Hyperplane Con(d2,d3) Hyperplane Con(d3,d5) Hyperplane X Con(d2,d3) Con(d3,d5) Region Region X Region 1 Terminal Point x Figure 3.8: Geometrical representation of the index space, the regions, the cones, the cone vectors, the communication vector, some chains and the optimal hyperplane of Example Example As an example, consider the 2D index space of a UDL with the following dependence vectors: d 1 = (1,7), d 2 = (2,4), d 3 = (3,2), d 4 = (4,4) and d 5 = (6,1). Except d 4, all other vectors belong to cone vector sequence: CV = d 1,d 2,d 3,d 5. The 5 cones that formed are colored respectively and the first 3 regions of all cones are shown (see Fig. 3.8). The communication vector in this example is d c = d 3. Note that only certain dependence vectors contribute to the formation of every hyperplane in each cone. These vectors are called cone vectors. Partitioning the index space into regions is beneficial for the time scheduling. However it is not sufficient for an efficient space scheduling, because it does not address the data locality issue. Therefore, an additional decomposition of the index space is necessary to enhance the data locality [MMS99][MSM04]. Usually, there are certain sequences of points computed by the same processor. Each such sequence can be viewed as a chain of computations, created by a certain dependence vector, as shown in Fig This dependence vector is called communication vector, noted d c, and is usually chosen to be the dependence (cone) vector that incurs the largest amount of communication (in most cases is the vector with the smallest absolute coordinates). The goal of ACS is to eliminate the communication cost incurred by d c, hence, attributing the communication cost of the loop nest to the remaining k 3 dependence vectors: d 1,...,d k. The communication vector d c defines the following family of lines in n-dimensional space: j = p + λd c, where p N n and λ R. This way, every index point belongs to one such line. Therefore, by defining the chain C r with offset r to be {j J j = r+λd c, for some λ R}, the index space J is partitioned into a set C of such chains. C r and C are the cardinalities 3 Hereafter k = p 1, i.e., the number of dependence vectors, excluding d c.

41 3.3.1 Adaptive cyclic scheduling description 29 of C r and C, respectively, and C M is the cardinality of a maximal chain of C. Generally, in an index space there are many maximal chains. The points of C r communicate via d i (i designating every dependence vector except d c, i.e., 1 i k) with the points of C r. Let Dr in be the volume of the incoming data for C r, that is the number of index points on which the points of C r depend on. Similarly, Dr out is the volume of the outgoing data for C r, that is the number of index points which depend on the points of C r. The total communication associated with C r is Dr in + Dr out. Hereafter, N P denotes the number of available processors Adaptive cyclic scheduling description The adaptive cyclic scheduling (ACS) algorithm is designed to enhance the data locality of programs containing uniform nested loops using the concept of chains, while taking advantage of the hyperplane scheduling method in [DAK + 02]. When mapping all points of a chain C r to a single processor, the communication incurred by d c is completely eliminated. Furthermore, assuming that C r sends data to C r1,...,c rm due to dependence vectors d 1,...,d m, by mapping C r and C r1,...,c rm to the same processor the communication incurred by d 1,...,d m is also eliminated. Similarly, the chains from which C r receives data, C r 1,...,C r m, can also be mapped to the same processor, hence further increasing the overall performance of the parallel program. Usually, it is not be possible to map all these chains to the same processor; yet the important thing is the following: when assigning a new chain to the processor that executed C r, systematically pick one of C r1,...,c rm,c r 1,...,C r m instead of choosing in a random or arbitrary manner. This strategy is guaranteed to reduce the communication cost. Depending on the type of the final system s architecture, the chains are assigned to processors as follows: on homogeneous systems in a load balanced fashion, whereas on heterogeneous systems they are assigned according to the available computational power of every processor. In both cases, the master-slave model is used. The ACS algorithm Input: An n-dimensional nested loop, with terminal point U. Master: (1) Determine the cone vectors. (2) Compute cones. (3) Use QHull to find the optimal hyperplane. (4) Choose the d c. (5) Form and count the chains. (6) Compute relative offsets between C (0,0) and the k dependence vectors. (7) Divide NP so as to cover most successfully the relative offsets below as well as above d c. If no dependence vector exists below (or above) d c, then choose the closest offset to NP. (8) Assign chains to slave in cyclic fashion. Slave: (1) Send request for work to master. (2) Wait for reply; store all chains and sort the points by the region they belong to. (3) Compute points region by region, and along the optimal hyperplane. Communicate only when needed points are not locally computed. Output: (Slave) When no more points in the memory, notify the master and terminate.

42 30 Static scheduling algorithms (Master) If all slaves sent notification, collect results and terminate ACS on homogeneous systems Consider a homogeneous system of 5 slaves and one master, and the example in Fig The dependence vectors are d 1 = (1,3), d 2 = (2,2), d 3 = (4,1) and the communication vector is d c = d 2. The points of C (0,0) send data to the points of C (0,0) +d 1 = C (0,2), C (0,0) +d 3 = C (3,0). By mapping all these chains to the same slave, say S 1, one can see that the data locality is greatly enhanced. The relative offsets are: r = 3 (below d 2 ) and r = 2 (above d 2 ). However the system has 5 slaves and it must cyclicly assign them chains, so as to cover most successfully the relative offsets. In particular, the 5 slaves are divided as follows: slaves S 1,S 2, and S 3 get cyclicly assigned chains below d c, thus eliminating communication along d 3, whereas the other 2 slaves, S 4 and S 5, are cyclicly assigned chains above d c, eliminating communication along d 1 as well. This way all slaves are used in the most efficient way. y S 5 S 4 S 5 S 4 S 5 S 4 S 5 S 4 S 5 S 4 S 1 S 2 S 3 S 1 S 2 S Region Region Region 1 S 1 S 2 S 3 S 1 S 2 S 3 S 1 S 2 Slave 1 Slave 2 Slave 3 Slave 4 Slave x Figure 3.9: ACS chain assignment on a homogeneous system with 5 slaves ACS on heterogeneous systems When dealing with heterogeneous systems, one must take into account the available computing power and the actual load of the slave making the request to the master. It is assumed that every process running on the slave takes an equal share of its computing resources. The available virtual power (ACP i ) of a slave i is computed by dividing its virtual computing power (V CP i ) by the number of processes running in its queue (Q i ). The total available computing power of the system (ACP) is the sum the available computing power of every slave. ACP i is communicated to the master when slave i makes a request for work. In this case, the master will assign chains to slave i according to the ratio ACP/ACP i. For instance, consider a heterogeneous system with 5 slaves and the example in Fig Suppose slave

43 3.4 Chain pattern scheduling 31 S 3 has less computational powers than the other slaves. Therefore it gets assigned only 4 chains, whereas the slaves S 1, S 2, S 4 and S 5 get assigned 5 chains each. Of course, this is an oversimplified example; in a larger heterogeneous system, and for bigger problems, the ACS algorithm would assign chains to slaves accurately reflecting the effect of the different ACP i of every slave. The adaptive cyclic scheduling described in this section has some similarities with the static cyclic and static block scheduling methods described in [MA01]. The similarities are: the assignment of iterations to a processor is determined a priori and remains fixed yielding reduced scheduling overhead; they require explicit synchronization between dependent chains or tiles. However, the three scheduling methods are in the same time different, and the main differences are: iterations within a chain are independent, while within a tile they are not (this promotes the adaptability of ACS to different architectures, as mentioned above); ACS significantly enhances the data locality, hence performs best when compared to the static cyclic and block methods. 3.4 Chain pattern scheduling This section describes a similar static scheduling algorithm, that also exploits the regularity of nested loops iterations spaces, named chain pattern scheduling (CPS) and published in [CAD + 05]. The loops considered in this approach, however, have uniform loop-carried dependencies and their loop body is complex, consisting of general program statements, such as assignments, conditions or repetitions. These loops are called general nested loops. CPS is an extension of [DAK + ][PAD01], in which a geometric method of pattern-outline prediction was presented for uniform dependence loops, but no space mapping method was presented. In [DAK + ] and [PAD01] no communication cost was taken into account, and the unit-execution time (UET) model was assumed. In contrast, CPS accounts for the communication cost and a trade-off is given between different space-mapping schemes. CPS combines the method in [DAK + ] with the static cyclic scheduling of [MA01], thus improving [DAK + ]&[MA01] in that it enhances the data locality utilizing methods from computational geometry, to take advantage of the regularity of the index space. In particular, groups of iteration points, call chains, defining points connected via a specific dependence vector, called communication vector d c (see [PAD01]&[CAP05]) have been identified. Specific chains are mapped to the same processor to enhance the data locality. It is a well-known fact that the communication overhead in most cases determines the quality and efficiency of the parallel code. The fundamental idea of CPS is that regardless of the underlying interconnection network (FastEthernet, GigabitEthernet, SCI, Myrinet), or of the number of processors within a node (for SMP systems), or of the system s homogeneity or heterogeneity, reducing the communication cost always yields enhanced performance. Given two scheduling policies that use the same number of processors, the one requiring less data exchange between the processors will almost certainly outperform the other. Due to the existence of dependence vectors, only a certain set of iteration points of J can be executed at every moment [PAD01]. The geometric border of this set, forms a polygonal shape called pattern. CPS executes every index point at its earliest computation time (ECT), imposed by the existing dependence vectors. This policy guarantees the optimal execution time of the entire loop (assuming an unbounded number of processors). The index space is partitioned into disjoint time subsets denoted Pat s, r 0, such that Pat s contains the set of points of J, whose earliest computation time is s. Fig illustrates the patterns, the

44 32 Static scheduling algorithms pattern outline, the pattern points and the pattern vectors for an general UDL with the 4 dependence vectors d 1 = (1,3), d 2 = (2,2), d 3 = (4,1), and d 4 = (4,3). y C r=(0,2) C r=(0,0) C r=(1,0) C r=(3,0) k time step pattern outline C r Pat d1 d Pat 2 d Pat 1 d x Figure 3.10: The patterns, pattern vectors and chains of a general UDL with 4 dependence vectors. Each shaded polygon depicts a pattern, that can be executed at a certain moment. The first 3 patterns are showed. The dashed lines mark the border of each pattern, called pattern outline. Except d 4, all other vectors contribute to the formation of the pattern outline, therefore are called pattern vectors. On the axes lie the pre-computed boundary points. The communication vector of the loop in Fig is d c = d 2. The chain with origin (0,0), noted (C (0,0) ) is shown, as well as the chains it sends data to due to d 1, d 4 and d 3, i.e., C (0,2), C (1,0) and C (3,0) respectively. As with ACS, the goal of CPS is to eliminate the communication cost incurred by d c, and attributing the communication cost to the remaining k dependence vectors: d 1,...,d k, k = p 1. The communication vector d c defines the following family of lines in n-dimensional space: j = p + λd c, where p N n and λ R. This way, every index point belongs to one such line. Thus, by defining the chain C r with offset r to be {j J j = r + λd c, for some λ R}, we partition the index space J into a set C of such chains. The offset r is chosen so as to have at least one of its coordinates equal to 0, in other words r is a pre-computed point. C r and C are the cardinalities of C r and C, respectively, and C M is the cardinality of a maximal chain 4 of C. The points of C r communicate via d i (i designating any dependence vector except d c ) with the points of C ri. Let Dr in be the volume of the incoming data for C r, i.e., the number of index points on which the points of C r depend on. Similarly, Dr out is the volume of the outgoing data for C r, i.e., the number of index points which depend on the points of C r. The total communication associated with C r is Dr in +Dr out. In the rest, NP denotes the number of available processors. 4 Generally, in an index space there are many maximal chains.

45 3.4.1 Chain pattern scheduling Chain pattern scheduling The chain pattern scheduling algorithm is devised to enhance the data locality of programs with general nested loops using the concept of chains, and taking advantage of the optimality of the scheduling method based on patterns. By mapping all points of a chain C r to a single processor, communication incurred by d c for those points is completely eliminated. Furthermore, assuming that C r sends data to C r1,...,c rk due to dependence vectors d 1,...,d k, by mapping C r and C r1,...,c rk to the same processor the communication incurred by d 1,...,d k is also eliminated. Similarly, the chains from which C r receives data, C r 1,...,C r k, can also be mapped to the same processor, thus eliminating even more communication. Generally, it will not be possible to map all these chains to the same processor; yet the important thing is the following: when assigning a new chain to the processor that executed C r, systematically pick one of C r1,...,c rk,c r 1,...,C r k instead of choosing in a random or arbitrary manner. This strategy is guaranteed to lead to significant communication reduction. To better illustrate our point we consider two different scenarios: (1) unbounded N P high communication moderate performance, and (2) fixed N P moderate communication good performance. A common feature of both scenarios is that chains are mapped to processors starting with chain C (0,0) and proceeding with the chains to its left (or above) and right (or below), like a fan spreading out. It is straightforward that by doing this more points become available for execution (are released from their dependencies) than in any other way. Scenario 1: Unbounded NP High Communication This is the case where there are enough available processors so that each chain is mapped to a different processor (see Fig. 3.11). Two similar examples are given: (a) A loop with three dependence vectors, d 1 = (1,3), d 2 = d c = (2,2) and d 3 = (4,1) with chains formed along d 2. (b) A loop with two dependence vectors, d 1 = (4,1) and d 2 = d c = (2,2), with chains also formed along d 2. Note that in both cases, 24 chains are created. However this scenario is somewhat unrealistic because for larger index spaces the number of chains would be much greater and, therefore, the number of processors required for assigning one chain to one processor would be prohibitive. On the other hand, this scenario does not support any kind of data locality (except for d 2 ), requiring a large volume of communication between the processors. Basically, a chain C r must receive data from k different chains and send data to k different chains. This implies that both Dr in and Dr out are bounded above by k C M (recall that C M is the cardinality of a maximal chain). The total volume of communication induced by this scheme, denoted V, is then: V 2k C M C. It is obvious that such a high volume of communication diminishes the overall performance of the parallel program. Scenario 2: Fixed NP Moderate Communication This scenario is designed to minimize the communication and enhance the data locality, thus increasing the overall performance of the parallel program. For this scenario, two alternative mappings are considered, as shown in Fig and In order to have moderate communication, and thus better performance than in scenario (1), an arbitrary NP was considered (herein 5 processors). The first mapping (see Fig. 3.12(a)(b)) is an implementation of the well known cyclic mapping [MA01], where each chain from the pool of unassigned chains is

46 34 Static scheduling algorithms d1 k step pattern border d2 k step pattern border d2 d3 d1 y P 21 P 19 P 17 P 15 P 13 P 11 P 9 P 7 P 5 P 1 P 3 P 6 P 8 P 10 P 12 P 14 P 16 P 18 P Pat P Pat 2 P Pat 1 P P 2 P 4 y P 21 P 19 P 17 P 15 P 13 P 11 P 9 P 7 P 5 P 3 P 1 P P P P P P P P Pat 3 P P Pat P Pat 1 P P x x (a) (b) Figure 3.11: Scenario (1) Unbounded N P high communication moderate performance: every chain is mapped to a different processor (unrealistic assumption with respect to the number of available processors). mapped in a cyclic fashion, starting with C (0,0). This means, that the same processor will execute chains C (0,0), C (5,0), C (10,0) and so on. The communication volume for this mapping depends on NP and on the chain-offsets r 1,...,r k corresponding to dependence vectors d 1,...,d k. In particular, if NP is greater than the maximum coordinate appearing in one of the offsets r 1,...,r k, then the volume of communication is prohibitive, basically being the same with the one in scenario (1), i.e., V 2k C M C. Otherwise, if NP is equal to a coordinate of an offset r i, the communication incurred by the corresponding dependence vector d i is eliminated. Hence, let q be the number of offsets that have one of their coordinates equal to NP; then the volume of communication is V 2(k q) C M C. The second mapping, differs from the previous in that it intentionally zeroes the communication cost imposed by as many dependence vectors as possible. In particular, in Fig. 3.13(a) the first three processors execute the chains below d c in a round-robin fashion, whereas the other two processors execute the chains above d c, again in a round-robin fashion. This way, the communication cost attributed to d 3 is eliminated for the chains C (0,0) to C (13,0) and the communication cost attributed to d 1 is eliminated for the chains C (0,1) to C (0,10). The difference from Fig. 3.12(a) is that in this case, the processors do not span the entire index space, but only a part of it (i.e., below or above d c ). The benefits of doing so are that a more realistic NP is assumed while still having acceptable communication needs, and performance is increased as a result. In Fig. 3.13(b) due to the fact that there is no dependence vector above d c, a cyclic mapping of chains is possible, starting with the chain C (0,0) and moving above and below, so as to incrementally release points for execution. This is similar to Fig. 3.12(b), with the difference that 5 processors were used instead of 3. The advantage of this scenario is that it does not limit the degree of parallelism because it uses all the available processors, hence being the most realistic of the two scenarios. To estimate the volume of communication in this case we reason as follows: consider

47 3.4.2 Implementation and performance results 35 NP as the sum of q 1,...,q l (i.e., NP = q q l and, ideally, l = k) such that each q i is a coordinate of offset r i. By proper assignment of groups of q i chains to processors, the communication cost incurred by d i is greatly diminished throughout the index space. Even if such a decomposition of NP in groups of q i processors is not possible, it is always possible to choose the appropriate integers q 1,...,q l such that each q i is a coordinate of r i and NP> q q l. Thus, by letting q = q q l, we conclude that the volume of communication is V 2(k q)c M C. k step pattern border P 1 P 2 P 3 P 4 P 5 d1 d2 d3 k step pattern border P 1 P 2 P 3 P 4 P 5 d2 d1 y P 1 P 4 P 2 P 5 P 3 P 1 P 4 P 2 P 5 P 3 P 1 P 2 P 1 P 3 P 5 P 2 P 4 P 1 P 3 P 5 Pat P Pat 2 P Pat 1 P 4 P 4 y P 1 P 4 P 2 P 5 P 3 P 1 P 4 P 2 P 5 P 3 P 1 P P P P P P P P Pat P Pat P Pat 1 P 4 P 2 P (a) x (b) x Figure 3.12: Scenario (2) Fixed N P Moderate Communication (cyclic mapping): chains are mapped to the available processors in a cyclic fashion starting with chain C 0. For the sake of simplicity, in both scenarios every chain was assigned to a single processor. This is best suited for distributed-memory systems, that consist usually of single processor nodes. Note however, that each chain may be assigned to more than one processors such that no communication among them is required. This is more advantageous for symmetric multiprocessor systems (SMPs), where processors of a node can communicate through the locally (intra-node) shared-memory. Our approach is also suitable for heterogeneous networks, in which case a processor with higher computational power would be given either longer or more chains, whereas one with lesser computational power would be given either fewer or shorter communication time chains. It is obvious that the ratio processing time is critical for deciding which scenario and which architecture suits best a specific application Implementation and performance results A program written in C++, which emulates the distributed-memory systems model, was used to validate the proposed methodology. This program implements both the cyclic scheduling method [MA01], and the chain pattern scheduling. We tested index spaces ranging from to index points. For all index spaces, the four dependence vectors of the loop nest given in Fig and the communication vector (2,2) were considered. Fig give the simulation results when NP ranges from 5 to 8 processors. Note that in every

36 Static scheduling algorithms k step pattern border P 1 P 2 P 3 P 4 P 5 d1 d2 d3 k step pattern border P 1 P 2 P 3 P 4 P 5 d2 d1 y 10 8 6 4 2 P 5 P 4 P 5 P 4 P 5 P 4 P 5 P 4 P 5 P 4 P 1 P 2 P 1 P 2

48 36 Static scheduling algorithms k step pattern border P 1 P 2 P 3 P 4 P 5 d1 d2 d3 k step pattern border P 1 P 2 P 3 P 4 P 5 d2 d1 y P 5 P 4 P 5 P 4 P 5 P 4 P 5 P 4 P 5 P 4 P 1 P 2 P 1 P 2 P 3 P 1 P 2 P 3 P 1 P 2 Pat P Pat 2 P Pat 1 P 2 P 3 y P 5 P 4 P 5 P 4 P 5 P 4 P 5 P 4 P 5 P 4 P P P P P P P P Pat 3 P Pat 2 P P Pat 1 P 2 P 2 P (a) x (b) x Figure 3.13: Scenario (2) Fixed N P Moderate Communication (mapping with offset 3 along the x axis, and offset 2 along the y axis): chains are mapped to the available processors so as to minimize communication imposed by d c, d 1 and d 3. case, and for all index space sizes, the chain pattern mapping performs better than the classic cyclic mapping. In particular, the communication reduction achieved with the chain pattern mapping ranges from 15% - 35%. Figure 3.14: Experimental results, CPS vs cyclic, on NP = 5

49 3.4.2 Implementation and performance results 37 Figure 3.15: Experimental results, CPS vs cyclic, on NP = 6 Figure 3.16: Experimental results, CPS vs cyclic, on NP = 7 Figure 3.17: Experimental results, CPS vs cyclic, on NP = 8

50 38 Static scheduling algorithms

51 CHAPTER4 Dynamic scheduling algorithms In dynamic scheduling, only a few assumptions about the parallel program or the parallel system can be made before execution, and thus, scheduling decisions have to be made on-thefly [AG91, PLR + 95]. The goal of a dynamic scheduling algorithm as such includes not only the minimization of the program completion time but also the minimization of the scheduling overhead which constitutes a significant portion of the cost paid for running the scheduler. Dynamic scheduling is necessary in situations where static scheduling may result in a highly imbalanced distribution of work among processes or where the dependencies between the tasks are dynamic, thus precluding a static scheduling approach. Since the primary reason for using a dynamic scheduling is balancing the workload among processes, dynamic scheduling is often referred to as dynamic load balancing. Dynamic scheduling techniques are usually classified as either centralized or distributed. In a centralized dynamic load balancing scheme, all executable tasks are maintained in a common pool or they are maintained by a special processor or a subset of processors. If a special processor is designated to manage the pool of available tasks, then it is often referred to as the master and the other processors that depend on the master to obtain work are referred to as slaves. Whenever a processor has no work, it takes a portion of available work from the central data structure or the master process. Whenever a new task is generated, it is added to this centralized pool and/or reported to the master processor. Centralized load balancing schemes are usually easier to implement than distributed schemes, but may have limited scalability. As more and more processors are used, the large number of accesses to the pool of tasks or the master processor tends to become a bottleneck. The ultimate in automatic load balancing is a self-scheduling system that tries to keep all processing resources running at maximum efficiency. There may be a central location or authority (usually called master) to which processors refer for work and where they return their results. An idle processor requests that it be assigned new work by sending a message to the master and in return receives one or more tasks to perform. This works nicely for tasks with small contexts and/or relatively long running times. If the master is consulted too often, or if it has to send (receive) large volumes of data to (from) processors, it can become a bottleneck. In a distributed dynamic load balancing scheme, the set of executable tasks are distributed among processors which exchange tasks at run time to balance work. Each process can send work to or receive work from any other process. These methods do not suffer from the bottleneck associated with the centralized schemes. However, they are more difficult to implement.

52 40 Dynamic scheduling algorithms In this chapter, two dynamic load-balancing distributed schemes are presented along with two dynamic load-balancing centralized schemes. The notations used in this chapter are given below: UDL uniform dependence loop, a loop nest with uniform precedence constraints GL general UDL, those nested loops for which the loop body consists of generic program statements (such as assignments, conditions and repetitions) and which exhibit uniform loop-carried dependencies. The execution and communication times are arbitrary. GRIDs are a special category of UDLs with unitary dependence vectors that reside on the axes. PE stands for processing element P 1,...,P m are the slaves. J aux auxiliary index space, obtained from the normal index space J by mapping groups of iterations from J to a point in J aux. SpectPat special pattern, containing p neighboring independent iteration points of J. L aux initial point of the auxiliary index space U aux terminal point of the auxiliary index space The convex hull formed from the index points j 1,...,j m is defined as: CH = {j N n j = λ 1 j λ m j m, where λ 1,...,λ m 0 and λ λ m = 1}. N is the number of scheduling steps, i = 1,...,N. C i : A few consecutive iterations of the loop are called a chunk; C i is the chunk size at the i-th scheduling step. V i is the size (in number of iterations) of chunk i along dimension u n. SP: In each chunk we introduce M synchronization points (SP) uniformly distributed along u 1. H is the interval (number of iterations along dimension u 1 ) between two SPs (H is the same for every chunk). The current slave is the slave assigned with the chunk i, whereas the previous slave is the slave assigned with the chunk i 1. V P k is the virtual computing power of slave P k. V P = m k=1 V P k is the total virtual computing power of the cluster. Q k is the number of processes in the run-queue of P k, reflecting the total load of P k. ACP: A k = V Pk Q k loop is executed in non-dedicated mode). is the available computing power (ACP) of P k (needed when the

53 4.1 Scaled GRIDs and parallel code generation 41 A = m k=1 A k is the total available computing power of the cluster. SC i,j is the set of iterations of chunk i, between SP j 1 and SP j. 4.1 Scaled GRIDs and parallel code generation This section describes a method for transforming sequential perfectly nested loops into their equivalent parallel form, published in [CAK + 03]. This work is based on parallelizing the special category of uniform dependence loops (UDLs), called Scaled GRIDs (SGRIDs). Every UDL in this category has the general structure of a GRID, i.e., all dependence vectors reside on the axes and the value of the non-zero coordinate is an arbitrary positive integer. GRIDs are a particular case of SGRIDs, where the dependence vectors are the unitary vectors along the axes. The problem addressed herein can be defined as follows: given a sequential program with a loop of SGRID UDLs structure, with a simple loop body, assuming that a fixed number of processors is available, produce its asymptotically optimal equivalent parallel program. The method described here always produces the equivalent parallel code in polynomial time. If the targeted architecture is a shared memory system, the produced parallel code is the optimal time scheduling for the available number of processors. However, if the targeted architecture is a distributed memory system, the produced parallel code is an efficient scheduling for the specified number of processors, and the experimental results are very close to the ideal speedup Scaled GRIDs Definition The Scaled GRIDs are a subclass of the general class of UDLs in which the set DS is partitioned into two disjoint subsets: AXV and DS AXV. The AXV (Axes Vectors) subset contains all the dependence vectors that lie along the axes, i.e., each dependence vector has the form d = λ e i, where e i is the unitary vector along dimension i and λ 1 is a positive integer. Further, we require that for every i, 1 i n, there exists at least one d = λ e i. Definition The special pattern (SpecPat) is a collection of iteration points that are non-dependent on each other throughout the execution of the whole nest, obtained by considering the vectors with the smaller values from the AXV set, such that all other dependence vectors of DS AXV do not create any dependency between iteration points within the pattern. The well known GRIDs are a special case of SGRIDs in which λ = 1 for all the dependence vectors, and any resulting SpecPat contains only one iteration point, due to the unitary dependence vectors that characterize GRIDs. Considering the number of completely independent points within the pattern is p, by replicating the SpecPat throughout the whole iteration space, collections of p independent points are obtained, that obviously can be concurrently executed, provided there are available hardware resources. Therefore, the loop nest can be scheduled according to the available number of processors, that should preferably be a multiple of p. In case the number of processors at hand is not a multiple of p then at some moment, several of the processors will stay idle because of no available iteration points eligible for execution. If the number of

54 42 Dynamic scheduling algorithms available resources is less than p (which is rarely the case) the results produced by our tool are not satisfactory. Therefore, hereafter we assume a multiple of p number of processors are available. Example Consider the 2D index space of an SGRID, representing the following twolevel loop nest: for (i=1; i<=10; i++) for (j=1; j<=10; j++) { A[i,j]=(A[i,j-2]*A[i-2,j]) 20 ; B[i,j]=2*B[i,j-3] 10 +B[i-4,j] 10-1; } For this example, the initial point of the index space is L = (1,1), the terminal point is U = (10,10) and the cardinality of J = 100 points. The dependence vectors are d 1 = (0,2), d 2 = (2,0), d 3 = (0,3) and d 4 = (4,0). Due to the size of the smallest vectors on each axe, the SpecPat is a collection of 4 independent iteration points. The iteration index space with the replicated patten is shown in Figure 1(a), while the auxiliary space is depicted in Figure 1(b). Notice the initial point in the auxiliary space is L aux = (0,0) and contains (1, 1) (the initial iteration point), (2, 1), (1, 2) and (2, 2) iteration points. The terminal point in the auxiliary space U aux = (4,4) contains the iteration points: (9,9), (9,10), (10,9) and (10,10) (the terminal iteration point). The cardinality of the auxiliary space is J aux = 25 patterns. d3 d1 13 d2 d4 10 k= k= U 4 k=3 3 k=6 U aux L (a) 0 L aux (b) Figure 4.1: Geometrical representation of the (a) iteration index space J and (b) auxiliary index space J aux of Example

55 4.1.2 Dynamic scheduling policy 43 The iteration index space J is partitioned into hyperplanes (or hypersurfaces) using the concept of earliest computation time [AKC + 03b, PAD01], that is, all the points that can be executed the earliest at moment k are said to belong to hyperplane k. As it can be seen in Figure 1, all the points that belong to hyperplane 3 in J, belong to the same hyperplane 3 in the auxiliary space J aux too. Of course, this holds for every hyperplane of J Dynamic scheduling policy The dynamic scheduling policy is described here. It is a distributed coarse-grain parallelism approach scheme. The tool schedules concurrently all the eligible iteration points gathered in a pattern, because there is no need to exchange data between iteration points of the same pattern. Any iteration space replicated with the special pattern is traversed pattern by pattern according to the lexicographic ordering of the resulting coordinates if each pattern were taken as a cartesian point in the auxiliary space. Therefore, the patterns execution order is the lexicographically traversal of the points on every hyperplane (zig-zag ordering) of the auxiliary space, as shown in Figure 1(b). In other words, all the patterns in the iteration space J are executed in a lexicographic order. At every moment, all the executable points are computed on the existing hardware resources, in groups of p processors. The processors within each such group work concurrently, due to being assigned independent iteration points. It is obvious though, that two distinct groups of processors may exchange data, if there is any dependency between the iteration points of one pattern and iteration points of the other. Consider the available resources are x processors. They are arranged in groups of p processors. Initially, the first group executes the first pattern (i.e., L aux ), the second executes the second pattern, and so on, until the last group executes the x/p-th pattern. Once the initial pattern assignment is done, every subsequent pattern of each group is determined by skipping x/p positions from the currently assigned pattern. In other words, the first group gets the L aux + x/p-th pattern, the second group gets the L aux + x/p + 1-th pattern, and so on, until the last group gets the L aux +2 x/p-th pattern. One can notice that, when groups are assigned patterns belonging to the same hyperplane, they can all compute concurrently. The scheduling proceeds this way, until reaching and executing the pattern containing the terminal point U of J. Consider the loop nest of Example 4.1.1, written in a high level programming language (such as C). Since loops have a simple loop body, a conventional parser can be used to extract the necessary information about the program. In this case, the depth of the loop nest is n = 2, the size of the iteration index space is J = 100, and the k = 4 dependence vectors are d 1 = (0,2), d 2 = (2,0), d 3 = (0,3) and d 4 = (4,0). Following the dependence analysis, the number of independent iteration points that form a pattern is determined to be p = 4. Before the scheduling begins as described previously, the tool requires as input from the user the number of available hardware resources x on which the sequential code is to be parallelized. Supposing x = 12 processors are available, they are arranged in three groups of 4 processors each. According to the scheduling policy, the first such group (G 1 ) gets assigned the pattern of the first hyperplane (hyperplane 0), the second (G 2 ) gets assigned the next lexicographic pattern (hyperplane 1) and the third (G 3 ) gets the next lexicographic pattern (also on hyperplane 1). After the initial assignment of patterns to each group, the next pattern for each is the one found three positions further in the lexicographic ordering. In other words, the second pattern for G 1 is found by skipping three patterns from the initial pattern position (0,0) in J aux, i.e. the first pattern on hyperplane 2, (0,2). The second pattern for G 2 is the pattern at (1,1), three position further from the

56 44 Dynamic scheduling algorithms initial pattern (0,1), and G 3 gets (2,0), three position further from the initial pattern (1,0). The skipping method of traversing the index space is used in order to avoid overlapping the execution of some pattern(s) (or iteration points) and constitutes the main benefit of using multiple groups of processors. The loop body of the sequential program is taken together with all assignment statements it contains and is passed on to each processor that will execute it for the appropriate iteration index point. It is worth mentioning that the only way to distinguish processors in MPI is referencing them by their rank. A rank is a positive integer assigned to each processor of the parallel system that designates the identification number of the processors throughout the parallel execution. Therefore, in order to know from which processors to receive necessary data and/or to send computed data to, the available processors are assigned ranks as follows: processors from first group get ranks between 0 to 3, processors from second group get ranks between 4 to 7 and the ones of third group get ranks between 8 and 11. The parallel code is generated upon completion of above phases and is described next Parallel code generation This section gives the parallel code the tool produced for the Example The generated pseudocode for the equivalent parallel version of the program is the following: forall (available processors) /* get coordinates of initial pattern and processor rank */ current_pattern = get_initial_pattern(rank); while (current_pattern!=out_of_auxiliary_space) { /* we know the current pattern and the rank, find the */ /* iteration point to be executed by each processor with rank */ current_iteration_position=find_position(current_pattern, rank); /* receive data from all points the current point depends on */ MPI_Recv(from_all_dependencies); /* execute the loop body associated to the current iteration */ execute(loop_body, current_iteration_position); /* send data to all points that depend on current point */ MPI_Send(to_all_dependencies); /* move on to the next pattern */ current_pattern=skip_3_patterns(current_pattern); } endforall where the function get_initial_pattern(rank) is the following: get_initial_pattern(rank){ if (0 <= rank <=3) return L_aux else if (4 <= rank <=7) return Skip_1_pattern(L_aux) else if (8 <= rank <=11) return Skip_2_patterns(L_aux) } Herein L_aux represents the initial point of the auxiliary space, i.e., L aux. This function returns the initial pattern coordinates for every of the three groups of 4 processors using the

57 4.1.4 Implementation and experimental results 45 skipping method and according to their rank values. The generated file containing MPI is ready to be executed onto the available processors Implementation and experimental results The MPI code the tool produced uses point-to-point communication and standard blocking send and receive operations, meaning that a send or receive operation blocks the current process until resources used for the operation can be reutilized. The experiments were ran on a cluster with 16 identical 500MHz Pentium nodes. Each node has 256MB of RAM and 10GB hard drive and runs Linux with kernel version. The MPI (MPICH) was used to run the experiments over the FastEthernet interconnection network. The performance of the tool was evaluated by experimenting with randomly generated UDLs. The speedup of the tool from the sequential program was measured. The tool can schedule to achieve speedups that scale with the number of available hardware resources. As the experimental results (Figure 2) validated, the introduced methodology obtained near optimal results for seven reasonably sized randomly generated applications, given below. The tool adapts well to shared memory systems by simply replacing the MPI communication calls with read and write calls from and to the shared memory of the multicomputer. The testing examples are the following: Table 4.1: The SGRID examples used for testing. Test # # of Size of index Dependencies dim, n space J points d 1 = (0,2), d 2 = (1,0) and d 3 = (1,2) points d 1 = (0,1), d 2 = (1,0) and d 3 = (1,1) points d 1 = (0,4), d 2 = (2,0) and d 3 = (4,2) points d 1 = (0,0,2), d 2 = (0,2,0), d 3 = (2,0,0) and d 4 = (2,2,2) points d 1 = (0,0,2), d 2 = (0,2,0), d 3 = (1,0,0) and d 4 = (1,2,2) points d 1 = (0,0,2), d 2 = (0,1,0), d 3 = (1,0,0) and d 4 = (1,1,2) points d 1 = (0,0,1), d 2 = (0,1,0), d 3 = (1,0,0) and d 4 = (1,1,1) The experimental results proved to be very close to the ideal speedup. However, by using standard blocking send/receive communication primitives, the tool introduces a communication overhead which can slightly reduce the speedup when the number of utilized processors exceeds an optimal threshold, i.e., when the communication to computation ratio becomes higher than one. On the other hand, the choice of using blocking communication primitives brings the advantage of accuracy regarding the appropriate data exchanged between processors at execution, and the experimental results proved to be near optimal even in this case. 4.2 Successive dynamic scheduling and parallel code generation for general loops A new distributed, fine-grain dynamic scheduling scheme, called successive dynamic scheduling (SDS) and published in [ACT + 04], is described in this section. Also, a parallel code generation tool, called Cronus/1, is introduced.

58 46 Dynamic scheduling algorithms Figure 4.2: Speedup results for 7 testing examples

59 4.2.1 Finding the optimal scheduling hyperplane using convex hulls 47 The problem addressed by SDS and Cronus/1 is defined as follows: given a sequential general UDL (GL), produce the equivalent high performance parallel program using a bounded (but arbitrary) number of processors. The approach to the above problem consists of three steps: Step 1 Determine the optimal family of hyperplanes 1 using the well known QuickHull algorithm [BDH96]. Step 2 Define the lexicographic ordering for this hyperplanes, and schedule the iterations on-the-fly using SDS. Step 3 Produce the equivalent parallel code, for any given number of processors. Step 1 is performed at compile time, whereas steps 2 and 3 are performed at run time. The reason for performing step 2 at run time is the ability to exploit the uniformity of the index space, in order to find an efficient adaptive rule. Also, when dealing with large index spaces, performing step 2 at compile time would be a very tedious work. In our approach, the overhead associated with step 2 is greatly reduced by performing it at run time. The run time scheduling (SDS) introduced in this paper produces a very efficient time scheduling while meeting the precedence constraints. Also, it scales with the available number of processors, in that for any given number of processors it produces an efficient time schedule. Cronus/1 produces the parallel code using SDS routines and MPI primitives for dynamic scheduling and executing of the sequential program. Cronus/1 takes as input a sequential program, which is parsed (manually by the user) and the essential parameters for code generation are extracted: depth of the loop nest (n), size of the index space ( J ) and dependence vectors set (DS) (if they exist). Once the crucial parameters are available, the tool calls the QuickHull algorithm (as an external program), which returns the optimal hyperplane for the specific problem. At this phase, the available number of processors (NP) is required as input. With the help of a code generator the appropriate parallel code is generated for the given number of processors. This parallel code contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for compilation and execution on the parallel system at hand Finding the optimal scheduling hyperplane using convex hulls The hyperplane (or wavefront) method [Lam74] was one of the first methods for parallelizing uniform dependence loops. Consequently, it formed the base of many heuristics algorithms developed in the recent years. The choice of the optimal hyperplane for a given index space was one of the challenges that many researchers tackled using diophantine equations, linear programming in subspaces and integer programming. In [PAD01] and [DAK + 02] the problem of finding the optimal hyperplane for uniform dependence loops was reduced to the problem of computing the convex hull of the dependence vectors and the terminal point. For instance, given the dependence vectors d 1 = (1,8), d 2 = (2,5), d 3 = (3,3), d 4 = (6,2) and d 5 = (8,1), consider two index spaces: in the first the terminal point is U 1 = (75,90) (see Figure 3 (a)), and in the second U 2 = (105,90) (see Figure 3 (b)). As shown in Figure 3 the the convex hulls are the index points contained by the polygons A,B,C,E,U 1 and A,B,C,E,U 2, respectively. The optimal hyperplane in the first case is defined by the dependence vectors ( d 2, d 3 ), whereas in the second case it is defined by ( d 3, d 5 ). Hence, 1 A hyperplane is optimal if no other hyperplane leads to a smaller makespan.

60 48 Dynamic scheduling algorithms U1 U 2 y y 8 A 8 A 6 d1 B 6 d1 B 4 d2 C 4 d2 C 2 d3 d4 d5 D E 2 d3 d4 d5 D E x (a) (b) x Figure 4.3: Optimal hyperplane for two different index spaces, same dependence vectors, different terminal points. the equation of the optimal family of hyperplanes in the first case is 2x 1 + x 2 = k,k N, and in the second case 2x 1 + 5x 2 = k,k N. Scheduling the loop nest of Figure 3(a) along hyperplane 2x 1 + x 2 = k leads to a parallel time of (240/9) + 1 = 28 time steps, assuming the initial point of the index space is L 1 = (0,0). The algorithm used to compute the convex hulls is the well known QuickHull algorithm [BDH96] Lexicographic ordering on hyperplanes The central issue in dynamic scheduling is finding an adaptive rule for instructing every processor what to do at run time rather than explicitly specifying at compile time. The adaptive rule is based on the ability to determine the next-to-be-executed point or the required-alreadyexecuted-point for any loop instance. In order to efficiently define such a rule, a partial ordering over the index points is necessary. This section describes the lexicographic ordering on hyperplanes by which the index space is traversed lexicographically along hyperplanes, yielding a zig-zag traversal in a 2D index space, or a spiral traversal in a 3D (or higher) index space. In order to achieve this, the concepts of successor and predecessor for a point are introduced. The successor provides: the means by which each processor can determine what iteration point to execute next, and a fast and reliable way for the currently running processor to determine the processors requiring the locally computed data. The converse hold for the predecessor. All index points on a hyperplane can be lexicographically ordered. Therefore, given a family of hyperplanes of the general form Π k : a 1 x a n x n = k, (4.1)

61 4.2.2 Lexicographic ordering on hyperplanes 49 where a i,x i,k N, there exists a lexicographically minimum and a lexicographically maximum index point on every hyperplane that contains index points 2. The algorithms that produce the minimum and maximum points of hyperplane Π k, denoted min k,n and max k,n, are given later in this section. Both min k,n and max k,n depend on the current hyperplane s value k and the number of coordinates n. Let i and j be two index points that belong to the same hyperplane Π k. j is the successor of i, denoted j = Succ(i), if i < j (i is lexicographically smaller than j) and for no other index point j of the same hyperplane does it hold that i < j < j. In the special case where i is the maximum index point of Π k, then Succ(i) is the minimum index point of Π k+1. In a similar fashion, i is the predecessor of j, denoted i = Pred(j), if j is the successor of i. Finally, we define: j if r = 0; Succ r (j) = Succ(j) if r = 1; Succ(Succ r 1 (j)) if r > 1. Continuing the example of the previous section, we depict in Fig. 4.4 minimum and the maximum points for hyperplanes 2x 1 +x 2 = 9 (Fig. 4.4(a)) and 2x 1 +5x 2 = 21 (Fig. 4.4(b)) min min 2 1 max 2 1 max (a) (b) Figure 4.4: Minimum and maximum points on hyperplanes. Computing the minimum and maximum point of a hyperplane Consider the hyperplane Π k : a 1 x a n x n = k, where a i,x i,k N. min k,n and max k,n can be defined recursively as follows: { ( k min k,1 = max k,1 = a 1 ) if k = a 1 k a 1 ; undef ined, otherwise Now let j = (j 1,...,j n ) be an index point of hyperplane Π k. Succ(j) is defined as follows: We advocate the use of successor index points along the optimal hyperplane as an adaptive dynamic rule. This rule is efficient in the sense that the overhead induced by its computation is negligible. Moreover, with regards to distributed memory platforms, it does not incur 2 It is possible the intersection of the index space with a particular hyperplane to be the empty set, in which case this hyperplane has no minimum and no maximum index point.

62 50 Dynamic scheduling algorithms k for (i= a n+1 ; i >= 0; i- -) { x n+1 = i; if min (k an+1 i),n is defined then return (min (k an+1 i),n,x n+1 ); } Figure 4.5: Recursive function for computing min k,n+1 of a hyperplane. for (l = n 1; l > 0; l- -) { p = j l j n ; for (i=1; i < p a l +1; i++) { q = p a l i; if min q,n l is defined then return (j 1,...,j l + a l i,min q,n l ); } } Figure 4.6: Function for computing the successor on any hyperplane. any additional communication cost. The index space is traversed hyperplane by hyperplane, and each hyperplane lexicographically, yielding a zig-zag/spiral traversal. This ensures its validity and its optimality. The former because the points of any subsequent hyperplane depend (eventually) on the points of the current hyperplane (and perhaps on points of previous hyperplanes), and the latter because by following the optimal hyperplane all inherent parallelism is exploited Overview of Cronus/1 This section gives an overview of the tool and details the SDS method. Cronus/1 is an existing semi-automatic parallelization tool and a detailed description of the tool is contained in the Cronus User s Guide that can be found at The organization of Cronus/1 is given in Fig In the first stage (User Input), the user inputs a serial program. The next stage (Compile Time) consists of the loop nest detection phase (Parallelism Detection). If no loop nest can be found in the sequential program, the tool stops. Throughout the second phase (Parameters Extraction), the program is parsed and the following essential parameters are extracted: depth of the loop nest (n), size of the (iteration) index space ( J ) and dependence vectors set (DS). Once the crucial parameters are available, the tool calls the QuickHull algorithm, which returns the optimal hyperplane. At this point, the available number of processors (N P) is required as input. With the help of a code generator the appropriate parallel code is generated (in the Semiautomatic Code Generation phase) for the given number of processors. This is achieved with the help of a Perl script that operates on a configuration file, which contains all the required information. Due to the fact that we tackle complex loop nests, implementation of a general parser for such loops is beyond the scope of our research. Hence, the user must manually define the loop body and the index space boundaries. In the configuration file the user must also define a startup function in C (automatically called by the parallel generated code) to perform data initialization on every processor, right before the actual parallel computation starts. The

4.2.3 Overview of Cronus/1 51 parallel code is written in C and contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for

63 4.2.3 Overview of Cronus/1 51 parallel code is written in C and contains run time routines for SDS and MPI primitives for data communication (if communication is necessary at all); it is eligible for compilation and execution on the parallel system at hand (in the Run Time stage). QuickHull Number of Available Processors SERIAL CODE PARSING PARALLELISM DETECTION PARAMETERS EXTRACTION SEMI- AUTOMATIC CODE GENERATION PARALLEL CODE (C+MPI+SDS) USER INPUT COMPILE TIME RUN TIME Figure 4.7: The organization of Cronus/1. Successive dynamic scheduling description SDS was developed based on the following assumption: the parallel system system is uniform/homogenenous (i.e., the processors are identical) and non preemptive (a processor completes the current task before executing a new one). The most prominent features of SDS are: It is a dynamic scheduling strategy both scheduling and task execution are performed at runtime based on the availability of processors and on releasing iteration points from their dependencies, if they exist. It is a distributed scheduling the scheduling task and/or the scheduling information are distributed among processors and their memories. It is a self-scheduling technique an idle processor determines its next iteration of the loop nest by incrementing the loop indices in a synchronized way. The scheduling policy is the following: assuming there are N P available processors (P 1,...,P NP ), P 1 executes the initial index point L, P 2 executes Succ(L), P 3 executes Succ 2 (L), and so on, until all processors are employed in execution for the first time. Upon completion of L, P 1 executes the next ready-to-be-executed (executable) point, found by skipping N P points in the index space (in the zig-zag/spiral manner described in 4.2.2). The coordinates of this point are obtained by applying the Succ function NP times to the point currently executed by P 1 : Succ NP (L). Similarly, upon completion of its current point (call it j), P 2 executes the point given by Succ NP (j) and so on until exhausting all index points. SDS ends when the terminal point U has been executed. Thus, SDS uses a perfect load distribution policy because it assigns to a processor only one loop iteration at a time. In other words, the way iterations are assigned to processors follows the round-robin scheduling method to achieve this perfect load balancing.

64 52 Dynamic scheduling algorithms Implementation and experimental results Cronus/1 was coded in C, except for the code generator written in Perl. The parallel code produced by Cronus/1 uses point-to-point, synchronous send and receive MPI calls when required. The experiments were conducted on a cluster with 16 identical 500MHz Pentium III nodes. Each node has 256MB of RAM and 10GB hard drive and runs Linux with kernel version. MPI (MPICH) was used to run the experiments over the FastEthernet interconnection network. Case Study: FSBM ME algorithm Block motion estimation in video coding standards such as MPEG-1, 2, 4 and H.261 is perhaps one of the most computation-intensive multimedia operations. Hence is also the most implemented algorithm. The block matching algorithm is an essential element in video compression to remove the temporal redundancy among adjacent frames. The motion compensated frame is reconstructed from motion estimated blocks of pixels. Every pixel in each block is assumed to displace with the same 2D displacement called motion vector, obtained with the Block ME algorithm. The Full-Search-Block-Matching Motion-Estimation Algorithm (FSBM ME) [YH95] is a block matching method, for which every pixel in the search area is tested in order to find the best matching block. Therefore, this algorithm offers the best match, at an extremely high computation cost. Assuming a current video frame is divided into N h N v blocks in the horizontal and vertical directions, respectively, with each block containing N N pixels, the most popular similarity criterion is the mean absolute distortion (MAD), defined as MAD(m,n) = 1 N 2 N 1 i=0 N 1 j=0 x(i,j) y(i + m,j + n) (4.2) where x(i,j) and y(i+m,j+n) are the pixels of the current and previous frames. The motion vector (MV) corresponding to the minimum MAD within the search area is given by MV = arg{minmad(m,n)}, p m,n p, (4.3) where p is the search range parameter. The algorithm focuses on the situation where the search area is a region in the reference frame consisting of (2p + 1) 2 pixels. In FSBM, MAD differences between the current block and all (2p + 1) 2 candidate blocks are to be computed. The displacement that yields the minimum MAD among these (2p+1) 2 positions is chosen as the motion vector corresponding to the present block. For the entire video frame, this highly regular FSBM can be described as a six-level nested loop algorithm, as shown in Fig As it can be seen from the figure, the general loop nest is designated by the outer two loops, whereas the inner four loops represent the loop body. Unfortunately, this algorithm does not have any loop-carried dependencies involving the two outer loop indices, i.e., the iterations of the FSBM loop body are completely independent of each other. This makes the FSBM ME algorithm a particular case study for Cronus/1 due to no loop-carried dependencies. However, Cronus/1 performs very well for the FSBM as it is (as it can be seen in later in this section), thus preserving the advantage of perfect load balancing provided by the SDS.

65 4.2.4 Implementation and experimental results 53 Figure 4.8: Six-level nested loop full search block matching motion estimation I. Testing Cronus/1 with FSBM The performance of Cronus/1 for our case study FSBM algorithm is presented here. The parallel execution time is the sum of communication time, busy time (SDS overhead + loop body computation) and idle time (time spend by a processor on waiting for data to become available, or points to become eligible for execution). The obtained speedup (defined as the sequential execution time over the parallel execution time) is reported in comparison with the ideal speedup for different number of processors. The efficiency (measured in percents) is defined as the speedup over the number of processors used to achieve the respective speedup, i.e. efficiency = Speedup # of processors, and the optimal efficiency is considered 100%. We experimented on the FSBM algorithm with different frame sizes and search ranges and produced the serial and parallel code. Our results prove to be very close to the ideal speedup. The results in Fig. 4.9 were produced for a frame of pixels, a search range of 10 blocks, and the block size was 8 8 pixels. The results in Fig are for pixels frame, search range of 15 blocks and the block size was the same as previous. Finally, Fig shows the performance of our tool for frames of pixels, search range 15 blocks, and the same block size. For all FSBM tests, Cronus/1 used the available 16-node cluster. This case study does not exhibit loop carried dependencies (with respect to the two outer

66 54 Dynamic scheduling algorithms loops). However we chose to present FSBM algorithm because of its high practical importance in video coding and to its extremely high computational cost (due to its complex loop body). Figure 4.9: Speedup for FSBM, with frame size 1024 x 768. Figure 4.10: Speedup for FSBM, with frame size 1280 x II. Testing Cronus/1 with random examples In this section we present the performance of Cronus/1 for a suite of randomly generated GLs with uniform loop-carried dependencies. The parallel execution time is taken as above (the sum of communication time, busy time: SDS overhead + loop body computation, and idle time: the time spend by a processor on waiting for data to become available, or points

67 4.2.4 Implementation and experimental results 55 Figure 4.11: Speedup for FSBM, with frame size 600 x to become eligible for execution). The obtained speedup is defined as above, and is reported in comparison with the ideal speedup for different number of processors. The efficiency (measured in percents) is defined as the speedup over the number of processors used to Speedup # of processors achieve the respective speedup, i.e. efficiency =, and the optimal efficiency is considered 100%. The prerequisite for parallelizing to obtain most favorable speedups is that the problem size must be sufficiently large. The two sets of randomly generated GLs have index space sizes of and , respectively, 2-6 uniform dependence vectors, with sizes of 1-8. The results for these two sets are given in Fig and The experimental results validate the presented theory and corroborate the efficiency of the generated parallel code. Figure 4.12: Speedup on random examples with 600 x 600 index space size.

68 56 Dynamic scheduling algorithms Figure 4.13: Speedup on random examples with 800 x 800 index space size. Results show that simplicity and efficiency are the key factors for minimizing the runtime of the parallel program. In our approach the compilation time is kept to a minimum because the index space is not traversed (not even once) and the decision of what iteration to compute next is taken at runtime. We decided to make this trade-off because the successor concept is very simple and efficient, i.e., it does not incur a significant penalty especially considering heavy loop bodies. An efficient parallelizing tool should strive to minimize the sum of compilation and run time compared to cost of the original sequential algorithm. SDS can easily be extended for a more coarse-grain approach to general loops, and we are currently working on that. 4.3 Dynamic multi-phase scheduling for heterogeneous clusters This section addresses the problem of scheduling loops with dependencies on heterogeneous (dedicated and non-dedicated) clusters. We extended three well known dynamic schemes (Chunk self-scheduling CSS, Trapezoid self-scheduling TSS and Distributed TSS) by introducing synchronization points at certain intervals so that processors compute in wavefront fashion. Our scheme, called dynamic multi-phase scheduling (DM P S) was published in [CAR + 05]. DMPS is a distributed coarse-grain approach that is applied to loops with uniform iteration dependencies and general loop bodies. Most cluster nowadays are heterogeneous and non-dedicated to specific users, yielding a system with variable workload. When static schemes are applied to heterogeneous systems with variable workload the performance is severely deteriorated. Dynamic scheduling algorithms adapt the assigned number of iterations to match the workload variation of both homogeneous and heterogeneous systems. An important class of dynamic scheduling algorithms are the self-scheduling schemes (such as CSS [KW85], GSS [PK], TSS [TN93], Factoring [BL00], and others [MJ94]). On distributed systems these schemes are implemented using a master-slave model.

69 4.3.1 Motivation for dynamic scheduling of loops with dependencies 57 Since distributed systems are characterized by heterogeneity, to offer load balancing loop scheduling schemes must take into account the processing power of each computer in the system. The processing power depends on CPU speed, memory, cache structure and even the program type. Furthermore, the processing power depends on the workload of the computer throughout the execution of the problem. Therefore, load balancing methods adapted to distributed environments take into account the relative powers of the computers. These relative computing powers are used as weights that scale the size of the sub-problem assigned to each processor. This significantly improves the total execution time when a nondedicated heterogeneous computing environment is used. Such algorithms were presented in [HBFF00],[HSUW96]. A recent algorithm that improves TSS by taking into account the processing powers of a non-dedicated heterogeneous system is DTSS (Distributed TSS) [CABG]. All dynamic schemes proposed so far apply only to parallel loops without dependencies. When loops without dependencies are parallelized with dynamic schemes, the index space is partitioned into chunks, and the master assigns these chunks to processors upon request. Throughout the parallel execution, every slave works independently and upon chunk completion sends the results back to the master. Obviously, this approach is not suitable for dependence loops because, due to dependencies, iterations in one chunk depend on iterations in other chunks. Hence, slaves need to communicate. Inter-processor communication is the foremost important reason for performance deterioration when parallelizing loops with dependencies. No study of dynamic algorithms for loops with dependencies on homogeneous or heterogeneous clusters has been reported so far. In order to parallelize nested loops with dependencies, after partitioning the index space into chunks (using one of the three schemes), we introduced synchronization points at certain intervals so that processors compute chunks in wavefront fashion. Synchronization points are carefully placed so that the volume of data exchange is reduced and the wavefront parallelism is improved Motivation for dynamic scheduling of loops with dependencies Existing dynamic scheduling algorithms cannot cope with uniform dependence loops. Consider, for instance, the heat equation, with its pseudocode below: /* Heat equation */ for (l=1; l<loop; l++) { for (i=1; i<width; i++){ for (j=1; j<height; j++){ A[i][j] = 1/4*(A[i-1][j] + A[i][j-1] + A [i+1][j] + A [i][j+1]); } } } When dynamic schemes are applied to parallelize this problem, the index space is partitioned into chunks, that are assigned to slaves. These slaves then work independently. But due to the presence of dependencies, the slaves have to communicate. However, existing dynamic schemes do not provide for inter-slave communication, only for master-to-slaves communication. Therefore, in order to apply dynamic schemes to dependence loops, one must provide an inter-slave communication scheme, such that problem s dependencies are not violated or ignored.

70 58 Dynamic scheduling algorithms In this work we bring dynamic scheduling schemes into the field of scheduling loops with dependencies. We propose an inter-slave communication scheme for three well known dynamic methods: CSS [KW85], TSS [TN93] and DTSS [CABG]. In all cases, after the master assigns chunks to slaves, the slaves synchronize by means of synchronization points. This provides the slaves with a unified communication scheme. This is depicted in Fig and 4.15, where chunks i 1,i,i + 1 are assigned to slaves P k 1,P k,p k+1, respectively. The shaded areas denote sets of iterations that are computed concurrently by different PEs. When P k reaches the synchronization point SP j+1 (i.e. after computing SC i,j+1 ) it sends P k+1 only the data P k+1 requires to begin execution of SC i+1,j+1. The data sent to P k+1 designates only those iterations of SC i,j+1 imposed by the dependence vectors, on which the iterations of SC i+1,j+1 depend on. Similarly, P k receives from P k 1 the data P k requires to proceed with the execution of SC i,j+2. Note that slaves do not reach a synchronization point at the same time. For instance, P k reaches SP j+1 earlier than P k+1 and later than P k 1. The existence of synchronization points leads to pipelined execution, as shown in Fig by the shaded areas. SP j SP j+1 SP j+2 C i+1 t+1 P k+1 C i SC i,j+1 t t+1 P k C i-1 SC i-1,j+1 t P k-1 communication set set of points computed at moment t+1 set of points computed at moment t indicates communication auxiliary explanations Figure 4.14: Synchronization points Dynamic scheduling for loops with dependencies This section gives a brief description of the three dynamic algorithms we used. Chunk Self- Scheduling (CSS) assigns a chunk that consists of a number of iterations (known as C i ) to a slave. A large chunk size reduces scheduling overhead, but also increases the chance of load imbalance. The Trapezoid Self-Scheduling (TSS) [TN93] scheme linearly decreases the chunk

71 4.3.2 Dynamic scheduling for loops with dependencies 59 u 2 V N... SP 1 SP 2 SP 3 SP j D T S S T S S C S S V i+1 V i V i-1 C i+1 C i C i-1 current slave previous slave... V 1 H H H... H... u 1 M Synchronization points Figure 4.15: Chunks are formed along u 2 and SP are introduced along u 1. size C i. Considering J the total number of iterations of the loop, in TSS the first and last (assigned) chunk size pair (F, L) may be set by the programmer. In a conservative selection, the (F,L) pair is determined as: F = J 2 m and L = 1. This ensures that the load of the first chunk is less than 1/m of the total load in most loop distributions and reduces the chance of imbalance due to large size of the first chunk. Then, the proposed number of steps needed for the scheduling process is N = 2 J (F+L). Thus the decrease between consecutive chunks is D = (F L)/(N 1). Then the chunk sizes in TSS are C 1 = F,C 2 = F D,C 3 = F 2 D,... Distributed TSS (DTSS) [CABG] improves on TSS by selecting the chunk sizes according to the computational power of the slaves. DTSS uses a model that includes the number of processes in the run-queue of each PE. Every process running on a PE is assumed to take an equal share of its computing resources. The programmer may determine the pair (F, L) according to TSS; or the following formula may be used in the conservative selection approach (by default): F = J 2 J 2 A and L = 1. The total number of steps is N = (F+L) and the chunk decrement is D = (F L)/(N 1). The size of a chunk in this case is C i = A k (F D (S k 1 + (A k 1)/2)), where: S k 1 = A A k 1. When all PEs are dedicated to a single process then A k = V k. Also, when all the PEs have the same speed then V k = 1 and the tasks assigned in DTSS are the same as in TSS. The important difference between DTSS and TSS is that in DTSS the next chunk is allocated according to a PE s available computing power, but in TSS all PEs are simply treated in the same way. Thus, faster PEs get more iterations than slower ones in DTSS. Table shows the chunk sizes computed with CSS, TSS and DTSS for an index space size of and and m = 10 slaves. CSS and TSS obtain the same chunk sizes in dedicated clusters as in non-dedicated clusters; DTSS adapts the chunk size to match the different computational powers of slaves. These algorithms have been evaluated for parallel loops and it has been established that the

72 60 Dynamic scheduling algorithms DTSS algorithm improves on the TSS, which in turn outperforms CSS [CABG]. Table 4.2: Sample chunk sizes given for J = and m = 10 Algorithm Chunk sizes CSS TSS DTSS (dedicated) DTSS (non-dedicated) For the sake of simplicity we consider a 2D dependence loop with U = (u 1,u 2 ), and u 1 u 2. The index space of this loop is divided into chunks along u 2 (using one of the three algorithms). Along u 1 synchronization points are introduced at equal intervals. The interval length (H), chosen by the programmer, determines the number of synchronization points The DMP S(x) algorithm The following notation is essential for the inter-slave communication scheme: the master always names the slave assigned with the latest chunk (C i ) as current and the slave assigned with the chunk C i 1 as previous. Whenever a new chunk is computed and assigned, the current slave becomes the (new) previous slave, whereas the new slave is named (new) current. Fig below shows the state diagram related to the (new) current (new) previous slaves. The state transitions are triggered by new requests for chunks to the master. Slave Pk finished chunk C i and Pk+1 is still undefined Slave P k-1 : previous Slave P k : current Slave P k+1 : undefined old state for scheduling SC i,j Slave Pk+1 requests new chunk from master Slave P k-1 : (former) previous Slave P k : (new) current Slave P k+1 : undefined Slave P k-1 : (former) previous Slave P k : (new) previous new state for Slave P k+1 : (new) current scheduling SC i+1,j Figure 4.16: State diagram of the slaves.

73 4.3.3 The DMPS(x) algorithm 61 The DMPS(x) algorithm is described in the following pseudocode: INPUT (a) An n-dimensional dependence nested loop, with terminal point U. (b) The choice of algorithm CSS, TSS or DTSS. (c) If CSS is chosen, then chunk size C i. (d) The synchronization interval H. (e) The number of slaves m; in case of DTSS the virtual power of every slave. Master: Initialization: (a) Register slaves. In case of DTSS, slaves report their ACP. (b) Calculate F,L,N,D for TSS and DTSS. For CSS use the given C i. 1. While there are unassigned iterations do: (a) If a request arrives, put it in the queue. (b) Pick a request from the queue, and compute the next chunk using CSS, TSS or DTSS. (c) Update the current and previous slave ids. (d) Send the id of the current slave to the previous one. Slave P k : Initialization: (a) Register with the master; in case of DTSS report ACP k. (b) Compute M according to the given H. 1. Send request to the master. 2. Wait for reply; if received chunk from master goto step 3 else goto OUTPUT. 3. While the next synchronization point is not reached compute chunk i. 4. If id of the send-to slave is known goto step 5 else goto step Send computed data to send-to slave. 6. Receive data from the receive-from slave and goto step 3. OUTPUT Master: If there are no more chunks to be assigned to slaves, terminate. Slave P k : If no more tasks come from master, terminate. Remark: (1) Note that the synchronization intervals are the same for all chunks. For remarks (2) (5) below refer to Fig for an illustration. (2) Upon completion of SC i,0, slave P k requests from the master the identity of the send-to slave. If no reply is received, then P k is still the current slave, and it proceeds to receive data from the previous slave P k 1, and then it begins SC i,1. (3) Slave P k keeps requesting the identity of the send-to slave, at the end of every SC i,j until either a (new) current slave has been appointed by the master or P k has finished chunk i. (4) If slave P k has already executed SC i,0,...,sc i,j by the time it is informed by the master about the identity of the send-to slave, it sends all computed data from SC i,0,...,s i,j. (5) If no send-to slave has been appointed by the time slave P k finishes chunk i, then all computed data is kept in the local memory of slave P k. Then P k makes a new request to the master to become the (new) current slave.

74 62 Dynamic scheduling algorithms Implementation and test results Our implementation relies on the distributed programming framework offered by the mpich implementation of the Message Passing Interface (MPI) [Pac97], and the version of the gcc compiler. We used a heterogeneous distributed system that consists of 10 computers, one of them being the master. More precisely we used: (a) 4 Intel Pentiums III 1266MHz with 1GB RAM (called zealots), assumed to have V P k = 1.5 (one of these was chosen to be the master); and (b) 6 Intel Pentiums III 500MHz with 512MB RAM (called kids), assumed to have V P k = 0.5. The virtual power for each machine type was determined as a ratio of processing times established by timing a test program on each machine type. The machines are interconnected by a Fast Ethernet, with a bandwidth of 100 Mbits/sec. We present two cases, dedicated and non-dedicated. In the first case, processors are dedicated to running the program and no other loads are interposed during the execution. We take measurements with up to 9 slaves. We use one of the fast machines as a master. In the second case, at the beginning of the execution of the program, we start a resource expensive process on some of the slaves. Due to the fact that scheduling algorithms for loops with uniform dependencies are usually static and no other dynamic algorithms have been reported so far, we cannot compare with similar algorithms. We ran three series of experiments for the dedicated and non-dedicated case: (1) DMPS(CSS), (2) DMPS(TSS), and (3) DMPS(DTSS) and compare the results for two real-life case studies. We ran the above series for m = 3,4,5,6,7,8,9 slaves in order to compute the speedup. We compute the speedup according to the following equation: S p = min{t P 1,T P2,...,T Pm } T PAR (4.4) where T Pi is the serial execution time on slave P i, 1 i m, and T PAR is the parallel execution time (on m slaves). Note that in the plotting of S p, we use V P instead of m on the x-axis. Test problems We used the heat equation computation for a domain of points, and the Floyd- Steinberg error diffusion computation for a image of pixels, on a system consisting of 9 heterogeneous slave machines and one master, with the following configuration: zealot1 (master), zealot2, kid1, zealot3, kid2, zealot4, kid3, kid4, kid5, kid6. For instance, when using 6 slaves, the machines used are: zealot1 (master), zealot2, kid1, zealot3, kid2, zealot4, kid3. The slaves in italics are the ones loaded in the non-dedicated case. As mentioned previously, by starting a resource expensive process on these slaves, their ACP is halved. I. Heat Equation The heat equation computation is one of the most widely used case studies in the literature, and its loop body is similar to the majority of the numerical methods used for solving partial differential equations. It computes the temperature in each pixel of its domain based on two values of the current time step (A[i-1][j], A[i][j-1]) and two values from the previous time step (A [i+1][j], A [i][j+1]), over a number of loop time steps. The dependence

75 4.3.4 Implementation and test results 63 vectors are: d 1 = (1,0) and d 2 = (0,1). The pseudocode is given below: /* Heat equation */ for (l=1; l<loop; l++) { for (i=1; i<width; i++){ for (j=1; j<height; j++){ A[i][j] = 1/4*(A[i-1][j] + A[i][j-1] + A [i+1][j] + A [i][j+1]); } } } An illustration of the dependence patterns is given in Fig The iterations in a chunk are executed in the order imposed by the dependencies of the heat equation. Whenever a synchronization point is reached, data is exchanged between the processors executing neighboring chunks. Table 4.3 shows comparative results we obtained for the heat equation, for the three series of experiments: DMPS(CSS), DMPS(TSS) and DMPS(DTSS), on a dedicated and a non-dedicated heterogeneous cluster. The values represent the parallel times (in seconds) for different number of slaves. Three synchronization intervals were chosen, and the total ACP ranged according to the number of slaves from Fig presents the speedups for the heat equation on an index space of points, for one time step (i.e. loop=1), for chunks sizes computed with CSS, TSS and DTSS and synchronization interval 150, on a dedicated cluster and a non-dedicated cluster. II. Floyd-Steinberg The Floyd-Steinberg computation [FS76] is an image processing algorithm used for the errordiffusion dithering of a width by height grayscale image. The boundary conditions are ignored. The dependencies are: d 1 = (1,0), d 2 = (1,1), d 3 = (0,1) and d 4 = (1, 1) The pseudocode is given below: /* Floyd-Steinberg */ for (i=1; i<width; i++){ for (j=1; j<height; j++){ I[i][j] = trunc(j[i][j]) + 0.5; err = J[i][j] - I[i][j]*255; J[i-1][j] += err*(7/16); J[i-1][j-1] += err*(3/16); J[i][j-1] += err*(5/16); J[i-1][j+1] += err*(1/16); } } An illustration of the dependence patterns is given in Fig The iterations in a chunk are executed in the order imposed by the dependencies of the Floyd-Steinberg algorithm. Whenever a synchronization point is reached, data is exchanged between the processors executing neighboring chunks.

76 64 Dynamic scheduling algorithms... u 2 HEAT EQUATION SP j SP j+1 V i+1 C i+1 V i C i V i-1 C i u 2... H... FLOYD-STEINBERG SP j SP j+1 u 1 V i+1 C i+1 V i C i V i-1 C i H... u 1 Figure 4.17: The dependence patterns for heat equation and Floyd-Steinberg. Comparative results for the Floyd-Steinberg case study on a dedicated and a non-dedicated heterogeneous cluster are given in Table 4.4. The values represent the parallel times (in sec-

4.3.5 Interpretation of the results 65 Table 4.3: Parallel execution times (sec) for heat equation on a dedicated & nondedicated heterogeneous cluster. onds) for different number of slaves.

19 presents the speedup results of the Floyd-Steinberg algorithm, for the three variations. The size of the index space was 10000 20000.

77 4.3.5 Interpretation of the results 65 Table 4.3: Parallel execution times (sec) for heat equation on a dedicated & nondedicated heterogeneous cluster. onds) for different number of slaves. Three synchronization intervals were chosen, and the total ACP ranged according to the number of slaves from Fig presents the speedup results of the Floyd-Steinberg algorithm, for the three variations. The size of the index space was Chunks sizes were computed with CSS, TSS and DTSS and synchronization interval chosen to be 100, on a dedicated cluster and a non-dedicated cluster Interpretation of the results As expected, the results for the dedicated cluster are much better for both case studies. In particular, DMPS(TSS) seems to perform slightly better than DMPS(CSS). This was expected since TSS provides better load balancing than CSS for simple parallel loops without dependencies. In addition, DM P S(DT SS) outperforms both algorithms. This is because it explicitly accounts for the heterogeneity of the slaves. For the non-dedicated case, one can see that DMPS(CSS) and DMPS(TSS) cannot handle workload variations as effectively as DMPS(DTSS). This is shown in Fig The speedup for DMPS(CSS) and

78 66 Dynamic scheduling algorithms Heat Equation, non-dedicated heterogeneous cluster DMPS(CS) DMPS(TSS) DMPS(DTSS) Speedup Virtual powers Figure 4.18: Speedups for the heat equation on a dedicated & non-dedicated heterogeneous cluster. DMPS(TSS) decreases as loaded slaves are added, whereas for DMPS(DTSS) it increases even when slaves are loaded. In the non-dedicated approach, our choice was to load the slow processors, so as to incur large differences between the processing power of the two machine types. Even in this case, DMPS(DTSS) achieved good results. The ratio commputation communication along with the selection of the synchronization interval play a key role in the overall performance of our scheme. A rule of thumb is to maintain this ratio 1 at all times. The choice of the (fixed) synchronization interval has a crucial impact on the performance, and it is dependent on the concrete problem. H must be chosen so as to ensure the ratio commputation communication is maintained above 1, even when V i decreases at every scheduling commputation step. Assuming that for a certain H, communication 1 is satisfied, small changes in the value of H do not significantly alter the overall performance. Notice that the best synchronization interval for the heat equation was H = 150, whereas for the Floyd-Steinberg better results were obtained for H = 100. The performance differences for interval sizes close to the ones depicted in Fig and 4.19 are small.

79 4.3.5 Interpretation of the results 67 Table 4.4: Parallel execution times (sec) for Floyd-Steinberg on a dedicated & nondedicated heterogeneous cluster.

Algorithms Design for the Parallelization of Nested Loops

NATIONAL TECHNICAL UNIVERSITY OF ATHENS SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT OF INFORMATICS AND COMPUTER TECHNOLOGY COMPUTING SYSTEMS LABORATORY Algorithms Design for the Parallelization