CASCH: A Software Tool for Automatic Parallelization and Scheduling of Programs on Message-Passing Multiprocessors

Size: px

Start display at page:

Download "CASCH: A Software Tool for Automatic Parallelization and Scheduling of Programs on Message-Passing Multiprocessors"

Brendan Stephens
6 years ago
Views:

1 CASCH: A Software Tool for Automatic Parallelization and Scheduling of Programs on Message-Passing Multiprocessors Abstract Ishfaq Ahmad 1, Yu-Kwong Kwok 2, Min-You Wu 3, and Wei Shu 3 1 Department of Computer Science, The Hong Kong University of Science and Technology Clear Water Bay, Hong Kong 2 Department of Electrical and Electronic Engineering, The University of Hong Kong Pokfulam Road, Hong Kong 3 Department of Electrical and Computer Engineering University of Central Florida, Orlando, FL , USA iahmad@cs.ust.hk, ykwok@eee.hku.hk, {wu, shu}@eee.engr.ucf.edu Corresponding Author: Ishfaq Ahmad Revised: May 1999 Existing parallel machines provide tremendous potential for high performance but their programming can be a cumbersome and error-prone process. This process is multi-phase in nature and consists of designing an appropriate parallel algorithm for the application at hand, implementing the algorithm by partitioning control and data, scheduling and mapping of the partitioned program onto the processors, orchestrating communication and synchronization, and identifying and interpreting various performance measures. A number of these phases, such as scheduling, mapping, communication, etc., can be very tedious if done manually, and thus should be done automatically. On the other hand, some of the more complex phases, such as parallelization, are better done semi-automatically. Software tools providing automatic functionalities free programmers from the nuisance of manual labor and can ensure better performance through code restructuring and optimization. This paper describes an experimental software tool called CASCH (Computer Aided SCHeduling) for parallelizing and scheduling applications on message-passing multiprocessors. CASCH transforms a sequential program to a parallel program with automatic scheduling, mapping, communication, and synchronization. The major strength of CASCH is its extensive library of scheduling and mapping algorithms representing a broad range of state-of-the-art work reported in the recent literature. These algorithms can be interactively analyzed, tested and compared using real data on a common platform with various performance objectives, enabling the programmer to select the most suitable algorithm for the application. CASCH with its graphical interface can be auspicious for both naive and expert programmers of parallel machines, and can also serve as a teaching and learning aid for understanding scheduling and mapping algorithms. Keywords: Automatic parallelization, scheduling, parallel programs, task graphs, message-passing multiprocessors, software tools.

2 1 Introduction Automated parallel programming environments are highly desirable by the programmers of parallel computers. Software tools embedded in a parallel programming environment can carry out a number of tasks, such as interprocessor communication and proper scheduling, freeing an average programmer from the major hurdles of parallelization and potentially improving the performance. Since these tasks can be quite tedious if done manually, the availability of automated tools is useful for an experienced programmer as well. Even though a large body of literature exists in the area of scheduling and mapping [1], [6], [7] (see Sidebar 1), a little portion of such knowledge has been exploited for practical purposes. Some software tools supporting automatic scheduling and mapping have been proposed but their main function is to provide a simulation environment [5]. While they can help in understanding the operation and behavior of scheduling and mapping algorithms, they are inadequate for practical purposes. On the other end of the spectrum, a large number of parallelizing tools (see Sidebar 2) have been proposed but they are usually not well integrated with sophisticated scheduling algorithms. This paper describe a software tool called CASCH (Computer Aided SCHeduling) for parallel processing on distributed-memory multiprocessors. CASCH is aimed to be a complete parallel programming environment, including parallelization, partitioning, scheduling, mapping, communication, synchronization, code generation, and performance evaluation. Program parallelization is performed by a compiler that automatically converts sequential applications into parallel codes. The parallel code to be executed on a target machine is optimized through proper scheduling and mapping,. CASCH is a unique tool in that it provides all of the important ingredients for developing parallel programs. It is useful for a naive programmer since its parallelization and code generation are done automatically. It can also help an experienced researcher since it provides various facilities to fine-tune and optimize a program. CASCH includes an extensive library of state-of-the-art scheduling algorithms from the recent literature. The library of scheduling algorithms is organized into different categories that are suitable for different architectural environments. The user can select one of these algorithms for scheduling the task graph generated from the application. The weights on the nodes and edges of the task graph are inserted using a database that contains the timing of various computation, communication, and I/O operations for different machines. These timings have been obtained through benchmarking. An attractive feature of CASCH is its graphical user interface which provides a flexible and easy-touse interactive environment for analyzing various scheduling and mapping algorithms, using task graphs generated randomly, interactively, or directly from real programs. The best schedule can be selected to be used by the code generator to generate a parallel code for a particular machine; the same process can be repeated for another machine. CASCH also can be used as a teaching aid for learning scheduling and mapping algorithms since it allows an interactive creation of task graphs and machine topologies, and provides a trace of a schedule which permits the identification of the order in which tasks are scheduled by a particular algorithm

3 The rest of this paper is organized as follows. Section 2 gives an overview of CASCH and describes it major functionalities. Section 3 includes the results of the experiments conducted on an Intel Paragon and an IBM SP2 using CASCH. The last section includes a discussion of future work and some concluding remarks. The first sidebar is a survey of scheduling algorithms and includes a taxonomy to classify the algorithms. The second sidebar discusses the related work and provides an overview of the reported programming and scheduling tools. 2 Overview of CASCH The overall organization of CASCH is shown in Figure 1. The main components of CASCH are: a compiler which includes a lexical analyzer and a parser; a task graph generator; a weight estimator; a scheduling/mapping module; a communication inserter; an interactive user interface; a code generator; a performance evaluation module. Interactive User Interface Weight Estimator Graphical Editing Tools Architectures Display Sequential User program Clusters of Workstations Intel Paragon Computation Timings Task Graphs Display Lexer and Parser Communication Timings Gantt Charts Communication Traffic Symbolic Table Task Graph Generator Scheduler Mapper Input/Output Timings UNC Algorithms BNP Algorithms APN Algorithms Mapping Algorithms Scheduling/Mapping Module EZ by Sarkar LC by Kim & Browne DSC by Yang & Gerasoulis MD by Wu & Gajski DCP by Kwok & Ahmad Communication Inserter HLFET by Hu ISH by Krautrachue & Lewis MCP by Wu & Gajski Performance Evaluation Module Application Statistics Machine Statistics Code Generator Parallel Testing Performance Reports Program ETF by Hwang et al. by Sih & Lee LAST by Baxter & Patel MH by Lewis & Rewini by Sih & Lee BU by Mehdiratta & Ghose BSA by Kwok & Ahmad Figure 1: The various components and functionalities of CASCH

4 These components are described below. 2.1 User Programs Using the CASCH tool, the user first writes a sequential program from which a task graph is generated. To facilitate the automation of program development, the sequential program is composed of a set of procedures called from the main program. A procedure, which should be written using the singleassignment rule, is an indivisible unit of computation to be scheduled on one processor. The grain sizes of procedures are determined by the programmer, and can be modified. Figure 2 shows an example (an implementation of a fast Fourier transform algorithm) in which the data matrix is partitioned by columns across processors. In the serial program, the constant N = PN SN determines the problem size. Specifically, the constants PN and SN control the granularity of the partitioning: the larger the value of SN, the higher the granularity. In the current implementation of CASCH, these constants are defined by the user at compile-time. The procedures InterMult and IntraMult are called several times. The control dependencies can be ignored, so that a procedure call can be executed whenever all input data of the procedure are available. Data dependencies are defined by the single assignment of parameters in procedure calls. Communications can be invoked only at the beginning and the end of procedures. In other words, a procedure receives messages before it begins execution, and sends messages after it has finished execution. In general, the user is required to implement the application (e.g., FFT) only in the form of a sequential program consisting of a set of procedures. The sequential program is basically an ordinary C program except that the user has to insert a few annotations in the form of #define compiler directives which instruct CASCH to invoke the partitioning of data arrays. For instance, in the FFT example, the user just needs to define values of PN and SN in the header of the sequential C program. In this example, PN = 4 and SN = Lexical Analyzer and Parser The lexical analyzer and parser analyze the data dependencies and user defined partitions. In our implementation of CASCH, both components are constructed with the help of lex and yacc. If a syntax or semantic error is discovered in this stage, the user is advised to fix the problem before proceeding to the task graph generation phase. For a static program, the number of procedures are known before program execution. Many numerical types of applications belong to this static class of programs [8]. Such a program is system independent since communication primitives are not specified in the program. Data dependencies among the procedural parameters define a macro dataflow graph (i.e., the task graph) [10]. 2.3 Task Graph Generation A macro dataflow graph, which is generated directly from the main program, is a directed acyclic graph (DAG) with start and end points. A macro dataflow graph consists of a set of tasks { T 1, T 2,, T n } and a set of edges { e 1, e 2,, e m } such that e k = T i T j for 1 k m and some i, j where 1 i, j n

5 Each node in the graph corresponds to a procedure or a task, and the node weight is represented by the procedure execution time. Each edge corresponds to a message transferred from one procedure to another procedure, and the weight of the edge is equal to the transmission time of the message. When two tasks are scheduled to a single processor, the weight of the edge connecting them becomes zero. 2.4 Weight Estimator The weights on the nodes and edges of the task graph are inserted with the help of an estimator that provides the execution times costs of various instructions as well as the cost of communication on a given machine. These timings have been obtained through benchmarking using an approach similar to [2], [4]. Communication estimation, which is obtained experimentally, is based on the cost for each communication primitive, such as send, receive, and broadcast. Our approach is similar to that used by Xu and Hwang [11]. Table 1 shows the communication times (assuming a stand-alone mode) for various target machines. Table 1: Communication timing constants (microseconds) for various target machines. Machine Start-up Rate per byte 1/ClockRate Intel Paragon IBM SP The current version of the computation estimator is a symbolic estimator. The estimation is based on reading through the code without running it. Its symbolic output is in the form of a function of input parameters of the code. With a symbolic estimator and a restricted class of C codes, the code does not need re-estimation for different problem sizes. The code may include functions and procedures, and the estimator generates performance for each of them. The code may have for loops. The boundaries of a loop can be either constants or input parameters. The cost of each operation or built-in function is specified in the cost files. The total amount of computation can be obtained by summing all costs of operations and functions for a segment of code. 2.5 Task Scheduling and Mapping A common approach to distribute the workload among p processors is to partition a problem into p tasks and perform a one-to-one mapping between the tasks and the processors. Partitioning can be done with the block, cyclic, or block-cyclic pattern [10]. Such partitioning schemes using simple scheduling heuristics such as the owner computes rule work for certain problems but could fail for many others, especially for irregular problems, as it is difficult to balance the load and minimize dependencies simultaneously. An irregular problem is handled by partitioning it into many tasks which are scheduled in order to balance the load and minimize communication. In CASCH, a task graph generated based on this partitioning is scheduled using a scheduling algorithm. Since, one scheduling algorithm may not be suitable for a certain problem on a given architecture [7], CASCH includes various algorithms which are We shall use node and task synonymously

6 Program FFT /* N: number of points for discrete Fourier transform, let N=PNxSN */ /* data[log(pn)+2][pn][sn] */ /* stores single-assigned data points for discrete Fourier */ /* transform organized as a PN x SN grid for parallel computation */ /* main program */ call Initiation; /* serial part; initialize the array `data' */ /* parallel inter-multiplication of data points */ for i = log(pn) downto 1 do for j = 0 to PN-1 step 1<<i do for k = 0 to 1<<(i-1)-1 do call InterMult(data[i+1][j+k], data[i+1][j+k+1<<(i-1)], data[i][j+k], SN); call InterMult(data[i+1][j+k+1<<(i-1)], data[i+1][j+k], data[i][j+k+1<<(i-1)], SN); /* in each iteration, InterMult can be executed if */ /* arrays data[i+1][j+k] and data[i+1][j+k+1<<(i-1)]*/ /* are available upon completion, data[i][j+k] and */ /* data[i][j+k+1<<(i-1)] will be avaliable */ endfor endfor endfor /* parallel intra-multiplication of data points */ for i = 0 to PN-1 do call IntraMult(data[1][i], data[0][i], SN); /* in each iteration, IntraMult can be executed if array */ /* data[1][i] is avaliable; upon completion, data[0][i], */ /* which is the result, will be avaliable */ endfor call OutputResult; /* serial part; inverse and return results */ EndProgram FFT /* Procedure InterMult */ Procedure InterMult(inArray1, inarray2, outarray, n) /* Input: inarray1, inarray2 data points for multiplication */ /* n number of data points in sub-array */ /* Output: outarray array of output data */ for i = 0 to n-1 do outarray[i] = inarray2[i];/* '@' is element-wide */ /* complex FFT operation*/ endfor EndProcedure InterMult /* Procedure IntraMult */ Procedure IntraMult(inArray, outarray, n) /* Input: inarray data points for multiplication */ /* n number of data points in sub-array */ /* Output: outarray array of output data */ for i = log(n) downto 1 do for j = 0 to n-1 step 1<<i do for k = 0 to 1<<(i-1)-1 do outarray[j+k] = inarray[j+k+1<<(i-1)]; outarray[j+k+1<<(i-1)] = inarray[j+k]; /* where '@' is element-wide complex FFT operation */ endfor endfor for j = 0 to n-1 do inarray[j] = outarray[j]; endfor endfor EndProcedure IntraMult Figure 2: A sequential program for fast Fourier transform. suitable to various environments. The advantages of having a wide variety of algorithms in CASCH are: The diversity of these heuristic algorithms allows the user to select a type of algorithm that is suitable to a particular architectural configuration

7 The common platform provided by CASCH allows simultaneous comparisons among various algorithms, based on a number of performance objectives such as schedule length, number of processors used, algorithm s running time, etc. The comparison among the algorithms can be done using manually-generated task graphs as well as real data measured at execution time of a number of applications. For a given application program, the user can optimize the code by running various scheduling algorithms and then choose the best schedule. 2.6 Communication Inserter Synchronization among the tasks running on multiple processors is carried out by communication primitives. The basic communication primitives for exchanging messages between processors are send and receive. They must be used properly to ensure a correct sequence of computation. These primitives can be inserted automatically, reducing the programmer s burden and eliminating insertion errors. The procedure for inserting communication primitive is as follows. After scheduling and mapping, each task has been allocated to a processor. If an edge leaves from a task to another task which belongs to a different processor, the send primitive is inserted after the task. Similarly, if an edge comes from another task in a different processor, the receive primitive is inserted before the task. The insertion method described above does not ensure a correct communication sequence because a deadlock may occur. Thus, we use a send-first strategy for a reordering of communication primitives. That is, we reorder receives according to the order of sends. The communication primitive insertion algorithm is described below. Communication Insertion Algorithm: Assume that after scheduling and mapping each task of the task graph is allocated to processor MT ( i ), where M is a function mapping a task number to a processor number. (1) For each edge e k from task T i to T j for which MT ( i ) MT ( j ), insert a send primitive after task in processor MT ( i ), denoted by Se ( k, T i, MT ( j )); insert a receive primitive before task T j in processor MT ( j ), denoted by Re ( k, T j, MT ( i )). Once a message has been scheduled to be sent to a processor, eliminate other sends and receives that transfer the same message to the same processor. Now, for each processor, we have a sequence, Xe ( m1, T m1, P m1 ), Xe ( m2, T m2, P m2 ),, where X could be either S or R. (2) For each pair of processors, say P 1 and P 2, extract all Se ( mi, T mi, P 2 ) from processor P 1 to form a subsequence S P1, and extract all Re ( mj, T mj, P 1 ) from processor P 2 to form a subsequence R P2. Step 2.1: Within each segment of the subsequence S P1 with the same task number, exchange the order of sends according to the order of receives as defined by the subsequence R P2. Step 2.2: If the two resultant subsequences are still not matched with each other, R P2 is reordered according to the order of. 2.7 Code Generation S P1 We use the example of Figure 2 to illustrate our code generation method. Figure 3 shows the generated T i T i - 6 -

8 parallel code for three processors (assuming N = 8 ). Note that only the main program for each processor is shown. The data structure is the same as in Figure 2. In this example, the initial data is stored at processor P0. Data is transmitted to other processors such that each processor obtains the portion of data required for its computation. Consequently, the memory space is compacted. To reduce the number of message transfers and, consequently, the time to initiate messages, several messages can be packed and sent together. For example, the first four messages can be packed into one message and sent to processor P0. Such optimizations are also implemented in CASCH. Finally, the fourth data partition of the result is received from processor P0, the third from processor P1, and the first and the second from processor P2. /* For P0 */ /* load array of data points from HOST */ receive(host, data[3][0]); receive(host, data[3][1]); receive(host, data[3][2]); receive(host, data[3][3]); InterMult(data[3][3],data[3][1],data[2][3],2); send(p1, data[2][3]); InterMult(data[3][1],data[3][3],data[2][1],2); InterMult(data[3][2],data[3][0],data[2][2],2); send(p1, data[2][2]); InterMult(data[3][0],data[3][2],data[2][0],2); InterMult(data[2][1],data[2][0],data[1][1],2); send(p2, data[1][1]); InterMult(data[2][0],data[2][1],data[1][0],2); send(p2, data[1][0]); InterMult(data[2][3],data[2][2],data[1][3],2); IntraMult(data[1][3],data[0][3],2); /* unload result array of data points to HOST */ send(host, data[0][3]); /* For P1 */ receive(p0, data[2][2]); receive(p0, data[2][3]); InterMult(data[2][2],data[2][3],data[1][2],2); IntraMult(data[1][2],data[0][2],2); /* unload result array of data points to HOST */ send(host, data[0][2]); /* For P2 */ receive(p0, data[1][1]); IntraMult(data[1][1],data[0][1],2); receive(p0, data[1][0]); IntraMult(data[1][0],data[0][0],2); /* unload result array of data points to HOST */ send(host, data[0][1]); send(host, data[0][0]); Figure 3: The parallel code for fast Fourier transform. 2.8 Graphical User Interface The graphical capabilities of CASCH provide the user with a an easy-to-use window-based interactive interface. The graphical interface includes the following facilities, which map to the buttons shown in Figure

Other options include the display of a randomly generated DAG or the interactive creation of a DAG. Zooming facilities (horizontally, vertically, or both) are included for proper viewing.

9 Figure 4: The main menu of CASCH. Source: The user can create, edit, or browse through sequential programs. The source button also includes a sub-menu for generating a task graph from the user program. DAGs: This includes a facility to display a task graph (i.e., a DAG) generated from the user program (Figure 5 shows the DAG for the FFT program). Other options include the display of a randomly generated DAG or the interactive creation of a DAG. Zooming facilities (horizontally, vertically, or both) are included for proper viewing. Figure 5: Display of the DAG for the FFT program. TIGs: This facility displays task graphs, which are TIG (task interacting graphs with undirected edges). This facility is similar in functionality to DAGs. Processor Network: This facility allows the user to display a processor architecture (including the processors and the network topology). The editing facilities, similar to DAGs, allow the user to interactively create various network topologies. An example processor graph is illustrated in Figure 6. Scheduling: This facility includes a sub-menu from which the user must first select one of the three classes of the scheduling algorithms, i.e., BNP, UNC, and APN. Within each class, the user needs to - 8 -

Figure 6: Display of processor graph. select one of the scheduling algorithms. The scheduling algorithm requires the user to enter a number of parameters.

10 Figure 6: Display of processor graph. select one of the scheduling algorithms. The scheduling algorithm requires the user to enter a number of parameters. Show Schedule: The schedule generated as the result of invoking a scheduling algorithm can be displayed using this facility (Figure 7 and Figure 8 show the scheduling of the FFT example by five different algorithms). A schedule is displayed using a Gantt chart showing the start and finish times of tasks on various processors. Clicking on any task in the Gantt chart displays its start and finish times; the total schedule length is shown in the right corner of the window. A schedule also includes communication messages on the network (displayed through another window which is invoked by clicking on any two processors). An important feature of this facility is the trace option which shows a step-by-step scheduling of each task. This is very useful for understanding the operation of a scheduling algorithm through observation of the order in which tasks are scheduled by the algorithm. Multiple such charts can be opened concurrently allowing a comparison among the schedules generated by various algorithms. Indeed, in most cases, it may be necessary to try different algorithms. Two additional scheduling examples are depicted by Figure 9 and Figure 10. Mapping: This set of facilities includes a number of mapping algorithms that are used to map TIGs onto the processors. At present, CASCH includes algorithms based on A*, recursive clustering, and simulated annealing [9]. Some scheduling algorithms (such as UNC algorithms) may first generate clusters that need to be mapped onto the processors using one of these mapping algorithms. Show Mapping: This shows an assignment of tasks to the processors generated by a mapping algorithm. The display includes a TIG in which a processor number is attached to each task (indicating the processor number to which this task is allocated). Code Generation: The code generator generates the parallel code for a given program according to a schedule/map generated by a scheduling/mapping algorithm. Performance: The performance facilities include processors utilization, time spent in computation For definitions of these terms, see sidebar, Recent Research in Multiprocessor Scheduling A Brief Introduction. Tasks 1 and 14 are shown as thin rectangles due to their small weights. A mapping algorithm is required if scheduling and mapping are done separately

11 (a) The schedule generated by the DCP algorithm (schedule length = 71). (b) The schedule generated by the DSC algorithm (schedule length = 68). (c) The schedule generated by the MD algorithm (schedule length = 78). Figure 7: Display of the Gantt charts showing the schedules generated by three UNC algorithms for the FFT program

The computation and communication timing results are obtained by inserting the dclock() procedure call before and after each

12 Figure 8: Display of the Gantt chart showing the schedules generated by the MCP and ETF algorithms for the FFT program (schedule length = 83). Figure 9: A Gaussian elimination task graph. and communication, and speedup. The computation and communication timing results are obtained by inserting the dclock() procedure call before and after each inter-task communication. Data Partitioning: This facility includes tools for displaying structured and unstructured meshes as well as partitioning of data across different processors

3 Results CASCH runs on a SUN workstation that is linked through a network to an Intel Paragon and an IBM SP2.

13 (a) The schedule generated by the DCP algorithm (schedule length = 439). (b) The schedule generated by the MCP and ETF algorithms (schedule length = 431). Figure 10: The schedules for the Gaussian elimination graph (Figure 9) produced by two scheduling algorithms. 3 Results CASCH runs on a SUN workstation that is linked through a network to an Intel Paragon and an IBM SP2. We have parallelized several applications on CASCH by using several of the scheduling algorithms described above. Here we discuss some preliminary results obtained by measuring the performance of three applications: FFT, Laplace equation solver, and N-Body problem. These results demonstrate the viability and usefulness of CASCH as well as make a comparison among various scheduling algorithms. For reference, the results obtained with code generated by random scheduling of tasks are included. Such target code is generated by first partitioning the data among processors in a fashion that reduces the dependencies among the partitions. Based on this partitioning, an SPMD-based code is generated by randomly allocating the tasks to the processors. Hereafter, randsch is used to denote the results of these randomly scheduled programs. The first set of results (see Table 2) are for the FFT example shown earlier with four different sizes of

14 input data: 512, 1024, 2048, and 4096 points. Table 2 shows the execution times for various data sizes using different scheduling algorithms. Each value is an average of ten runs on the Intel Paragon and IBM SP2. The paragon consists of 140 i860/xp processors with a clock speed of 50MHz while the SP2 consists of MHz IBM P2SC processors. Table 2: Execution times of the FFT application for all the scheduling algorithms on the Intel Paragon and IBM SP2. Number of Points Algorithm randsch Paragon SP2 Paragon SP2 Paragon SP2 Paragon SP UNC BNP APN DCP DSC EZ LC MD ETF HLFET ISH LAST MCP BSA BU MH We observe that the execution times vary considerably with different algorithms. Among the UNC algorithms, the DCP algorithm yields the best performance due to its superior scheduling method; it also yields the best performance overall. Among the BNP algorithms, MCP and are in general better, primarily because of their better task priority assignment methods. Among the APN algorithms, BSA and MH perform better, due to their proper allocation of tasks and messages. All algorithms perform better than randsch: compared to the random scheduling, the level of performance improvement is up to 400%. Our second application is based on a Gauss-Seidel algorithm to solve Laplace equations. The four matrix sizes used are 8, 16, 32, and 64. The application execution times using various algorithms and data size are shown in Table 3. Again, using the DCP algorithm, more than 400% improvement over randsch is obtained. The UNC algorithms in general yield better schedules. The third application is the N-Body problem. The execution times results are shown in Table 4. Again, the scheduling algorithms demonstrate a similar trend in application execution times on both parallel machines as in the previous two applications. The running times of the scheduling algorithms for the three applications are shown in Table 5. Here, we can see that some scheduling algorithms take much longer times than the others due to their higher complexities (for details about the complexities of the algorithms, the reader is referred to [7]). Thus, there is a trade-off between the performance and the speed of a scheduling algorithm. For example, the DCP algorithm can generate better solutions the DSC algorithm but it is slower

15 Table 3: Execution times of the Laplace equation solver application for all the scheduling algorithms on the Intel Paragon and IBM SP2. Matrix Dimension UNC BNP APN Algorithm randsch DCP DSC EZ LC MD ETF HLFET ISH LAST MCP BSA BU MH Paragon SP2 Paragon SP2 Paragon SP2 Paragon SP Algorithm randsch Table 4: Execution times of the N-Body application for all the scheduling algorithms on the Intel Paragon and IBM SP2. Number of Points Paragon SP2 Paragon SP2 Paragon SP2 Paragon SP UNC BNP APN DCP DSC EZ LC MD ETF HLFET ISH LAST MCP BSA BU MH One important point to note from the above preliminary experimental results is that the performance of the scheduling algorithms can have substantial variations. For instance, even though the average performance of the DCP algorithm is better overall, it does perform worse in some cases. Thus, the user may have to try different schedulers to obtain the best results for the application at hand

16 Table 5: Scheduling times (seconds) for the applications on a SPARC Station 2 for all the scheduling algorithms. Number of Points Matrix Dimension Algorithm Algorithm UNC DCP DSC EZ LC MD UNC DCP DSC EZ LC MD BNP ETF HLFET ISH LAST MCP BNP ETF HLFET ISH LAST MCP APN BSA BU MH APN BSA BU MH (a) FFT application. (b) Laplace equation solver application. Number of Points UNC Algorithm DCP DSC EZ LC MD BNP ETF HLFET ISH LAST MCP APN BSA BU MH (c) N-Body application. 4 Conclusions and Future Work CASCH achieves the objectives of automatic parallelization and scheduling of applications by providing a unified environment for various existing and conceptual machines. A combination of a parallel code generator and a scheduler allows to test new ideas in scheduling with real applications instead of just simulations. CASCH also provides a framework for users to compare various scheduling algorithms and optimize their code by choosing the best algorithm. As shown in the experimental results, even an effective scheduling heuristic can sometimes produce inferior solutions. The extensive scheduling algorithms

17 library in CASCH, which includes various state-of-the-art scheduling algorithms, allows the user to optimize the execution of a parallel application by choosing the best schedule with the help of the interactive scheduling interface. We are currently working on extending the capabilities of CASCH by including the following: Support of distributed computing systems such as a collection of diverse machines working as a distributed heterogeneous supercomputer system; Extension of the current database of benchmark timings by including more detailed and lower level timings of various computation, communication and I/O operations of various existing machines; Inclusion of debugging facilities for error detection and global variable checking, etc.; Design and implementation of partitioners for automatic or interactive partitioning of programs; Design of an intelligent tool that will select an appropriate scheduling algorithm for a given application; and Enhance the task graph module so that huge task graphs (e.g., for Laplace equation of a larger matrix size) can be handled; in this regard, the parameterized task graph (PTG) technique proposed by Cosnard and Loi [3] is being considered for implementation. Acknowledgments The authors would like to thank the referees for their constructive and insightful comments which have greatly improved the presentation of this paper. Preliminary versions of portions of this paper have been presented at the 1997 International Conference on Parallel Processing, and the 3rd European Conference on Parallel Processing. This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST RI 93/94.EG06 and HKUST734/97E. References [1] E.G. Coffman, Computer and Job-Shop Scheduling Theory, Wiley, New York, [2] W.W. Chu, M.-T. Lan, and J. Hellerstein, Estimation of Intermodule Communication (IMC) and Its Applications in Distributed Processing Systems, IEEE Trans. Computers, vol. C-33, no. 8, pp , Aug [3] M. Cosnard and M. Loi, Automatic Task Graphs Generation Techniques, Parallel Processing Letters, vol. 5, no. 4, pp , Dec [4] A. Ghafoor and J. Yang, A Distributed Heterogeneous Supercomputing Management System, IEEE Computer, vol. 26, no. 6, pp , June [5] S.J. Kim and J.C. Browne, A General Approach to Mapping of Parallel Computation upon Multiprocessor Architectures, Proc. Int l Conf. Parallel Processing, vol. II, pp. 1-8, Aug [6] Y.-K. Kwok and I. Ahmad, Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs onto Multiprocessors, IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 5, pp , May [7] Y.-K. Kwok and I. Ahmad, Benchmarking the Task Graph Scheduling Algorithms, Proc. Int l Parallel Processing Symposium, pp , Apr [8] R.E. Lord, J.S. Kowalik, and S.P. Kumar, Solving Linear Algebraic Equations on an MIMD Computer, J. ACM, vol. 30, no. 1, pp , Jan [9] T.M. Nabhan and A.Y. Zomaya, A Parallel Simulated Annealing Algorithm with Low Communication Overhead, IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 12, pp

18 1233, Dec [10] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press, Cambridge, MA, [11] Z. Xu and K. Hwang, Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2, IEEE Parallel and Distributed Technology, vol. 4, no. 1, pp. 9-23, Spring Sidebar 1: Recent Research in Multiprocessor Scheduling A Brief Introduction The most common models of a parallel program are the precedence-constrained directed acyclic graph (DAG) and the task interacting graph (TIG) with no temporal dependencies. Figure 11(a) shows a parallel loop nest and Figure 11(b) depicts the DAG representing the loop. (1) Data 0, j := 0; for all j (3) for i = 1 to 3 do in parallel (4) for j = 1 to 3-(i-1) do in parallel (5) Task ij (INPUT: Data i-1, j, Data i-1, j+1 ; OUTPUT: Data ij ) (a) A parallel loop nest. Task 11 Task 12 Task 13 Data 11 Task 21 Data 12 Task22 Data 13 Data 21 Data 22 Task 31 Figure 11: (a) A parallel program fragment and (b) a directed acyclic graph representing the program fragment. (b) A DAG modeling the loop nest. The weight associated with a node represents the amount of execution time of the corresponding task and the weight associated with an edge represents the amount of communication time. Numerous techniques have been proposed in the literature for generating the node and edge weights off-line such as execution profiling and analytical benchmarking [5]. With such a static model, a scheduler is invoked offline during compile-time. This form of the multiprocessor scheduling problem is called static scheduling or DAG scheduling. Figure 12 provides a taxonomy of static parallel scheduling algorithms. The taxonomy is partial since it does not include details of some of the earlier work on scheduling. Only those scheduling algorithms that can be used in a realistic environment and are relevant in our context are considered. The taxonomy is hierarchical and develops by expanding each layer. Thick arrows indicate the relevance to our discussion and a further division of a particular layer; the thin arrows do not lead to a further division in the taxonomy. The highest level of the taxonomy is divided into two categories, depending upon whether the tasks are independent or not. This discussion is limited to dependent tasks. Earlier algorithms make simplifying assumptions about the task graph representing the program and the model of the multiprocessor system. Some algorithms ignore the precedence constraints and considered the task graph to be free of temporal dependencies (task interacting graph). The algorithms considering the more realistic task precedence

19 constrained graph assume the graph to be of a special structure such as tree, forks-join, etc. In general, however, parallel programs come in a variety of structures. The algorithms designed to tackle arbitrary graph structures can be divided further into two categories. Some algorithms assume the computational costs of all the tasks to be the same; others assume the computational costs of tasks to be arbitrary. It is worth mentioning that the scheduling problem is NP-complete even in two simple cases: (1) scheduling unit-time tasks to an arbitrary number of processors, (2) scheduling one or two time unit tasks to two processors. Scheduling with communication may be done with or without duplication of tasks [3]. Each class can be further divided into two categories. Note that, only the division of No-Duplication class is shown. An exact division of Duplication can also be envisaged but is not shown here due to its similarity with the No- Duplication class. Some scheduling algorithms assume the availability of an unlimited number of processors [2], [6], [7], [9], with a fully connected network. These are called the UNC (unbounded number of clusters) scheduling algorithms. The algorithms assuming a limited number of processors are called the BNP (bounded number of processors) scheduling algorithms. In the UNC and BNP scheduling algorithms, the processors are assumed to be fully-connected, and link contention or routing strategies used for communication are ignored. If scheduling and mapping are done in separate steps, the schedules generated by the UNC or BNP algorithms can be mapped onto the processors using the indirect mapping approach. The algorithms that assume the system to have an arbitrary network topology are called the APN (arbitrary processor network) scheduling algorithms [8]. The basis of a major component of scheduling algorithms (in all three classes) is the classical list scheduling approach [1], [4]. In list scheduling the tasks are assigned priorities and placed in a ready list arranged in a descending order of priority. The task with a higher priority is examined for scheduling before a task with a lower priority. If more than one task has the same priority, ties are broken using some method. A task selected for scheduling is allocated to a processor that allows the earliest start time. After a task is scheduled, more tasks may be added in the ready list. Again, the tasks in the ready list are examined and scheduled. This continues until all tasks are scheduled. The scheduling algorithm library of CASCH includes five UNC, six BNP, and four APN algorithms. The major characteristics of these algorithms are briefly described below. The reader is referred to [2] for more detailed description and comparison. References for Sidebar 1: [1] T.L. Adam, K.M. Chandy, and J. Dickson, A Comparison of List Scheduling for Parallel Processing Systems, Comm. ACM, vol. 17, no. 12, pp , Dec [2] I. Ahmad, Y.-K. Kwok, and M.-Y. Wu, Analysis, Evaluation, and Comparison of Algorithms for Scheduling Task Graphs on Parallel Processors, Proc. 2nd Int l Symposium on Parallel Architecture, Algorithms, and Networks, pp , June [3] I. Ahmad and Y.-K. Kwok, On Exploiting Task Duplication in Parallel Program Scheduling, IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 9, pp , Sept

20 Static Parallel Scheduling Independent Tasks Multiple Interacting Tasks Task Interaction Graph Task Precedence Graph Arbitrary Graph Structure Restricted Graph Structure Arbitrary Computational Costs Unit Computational Costs No Communication With Communication No Duplication Duplication Unlimited Number of Processors Limited Number of Processors Processors Fully Connected Processors Arbitrarily Connected UNC Algorithms BNP Algorithms APN Algorithms Figure 12: A partial taxonomy of the multiprocessor scheduling problem. [4] H. El-Rewini and T.G. Lewis, Scheduling Parallel Programs onto Arbitrary Target Machines, J. Parallel and Distributed Computing, vol. 9, no. 2, pp , June [5] K. Hwang, Z. Xu, and M. Arakawa, Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing, IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 5, pp , May [6] S.J. Kim and J.C. Browne, A General Approach to Mapping of Parallel Computation upon Multiprocessor Architectures, Proc. Int l Conf. Parallel Processing, vol. II, pp. 1-8, Aug [7] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press, Cambridge, MA, [8] G.C. Sih and E.A. Lee, A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures, IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 2, pp , Feb [9] T. Yang and A. Gerasoulis, DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors, IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 9, pp , Sept Sidebar 2: Other Parallel Programming Tools Several research efforts have demonstrated the usefulness of program development tools for parallel processing on message-passing multiprocessors. Essentially, these tools fall into two classes. The first

21 class, which is mainly comprised of commercial tools, provides software development and debugging environments. The ATEXPERT by Cray Research [1] is an example. Some of these tools also provide performance tuning tools and other program development facilities. The second class performs some program transformation through program restructuring. PARASCOPE [3] and TINY [9] are restructuring tools that automatically transform sequential programs to parallel programs. TOP/DOMDEC [2] is a tool for program partitioning. Some of the recently reported prototype scheduling tools are described below. PAWS [6] is a performance evaluation tool that provides an interactive environment for performance evaluation of various multiprocessor systems. PAWS does not perform scheduling and mapping, and does not generate any code. It is useful only for simulating the execution of an application on various machines. Hypertool [10] takes a user-partitioned sequential program as input and automatically allocates and schedules the partitions to processors. Proper synchronization primitives also are automatically inserted. Hypertool is a code generation tool since the user program is compiled into a parallel program for the ipsc/2 hypercube computer using parallel code synthesis and optimization techniques. The tool also generates performance estimates including execution time, communication and suspension times for each processor, as well as network delay for each communication channel. Scheduling is done using the MD algorithm or the MCP algorithm. PYRROS [8] is a compile-time scheduling and code generation tool. Its input is a task graph and the associated sequential C code. The output is a static schedule and a parallel C code for ipsc/2. PYRROS consists of a task graph language with an interface to C, a scheduling system that uses only the DSC algorithm, an x-windows based graphic displayer, and a code generator. The task graph language allows the user to define partitioned programs and data. The scheduling system is used for clustering the task graph, performing load balanced mapping, and computation/communication ordering. The graphic displayer is used for displaying task graphs and scheduling results in the form of Gantt charts. The code generator inserts synchronization primitives and performs parallel code optimization for the target parallel machine. Parallax [4] incorporates seven classical scheduling heuristics designed in the seventies, providing an environment for parallel program developers to find out how the schedulers affect program performance on various parallel architectures. Users must provide the input program as a task graph and estimate task execution times. Users must also express the target machine as an interconnection topology graph. Parallax then generates schedules in the form of Gantt charts, speedup curves, processor and communication efficiency charts using an x-windows interface. In addition, an animated display of the simulated running program helps developers to evaluate the differences among the provided scheduling heuristics. Parallex, however, is not reported to generate executable parallel code. OREGAMI [5] is designed for use in conjunction with parallel programming languages that support a communication model, such as OCCAM, C*, or with traditional programming languages like C and FORTRAN extended with communication facilities. As such, it is a set of tools that includes a LaRCS compiler to compile textual user task descriptions into specialized task graphs, which are called TCG

Programmers of parallel computers regard automated parallelprogramming

Programmers of parallel computers regard automated parallelprogramming Automatic Parallelization Ishfaq Ahmad Hong Kong University of Science and Technology Yu-Kwong Kwok University of Hong Kong Min-You Wu, and Wei Shu University of Central Florida The authors explain the