CASCH: A Software Tool for Automatic Parallelization and Scheduling of Programs on Message-Passing Multiprocessors
|
|
- Brendan Stephens
- 6 years ago
- Views:
Transcription
1 CASCH: A Software Tool for Automatic Parallelization and Scheduling of Programs on Message-Passing Multiprocessors Abstract Ishfaq Ahmad 1, Yu-Kwong Kwok 2, Min-You Wu 3, and Wei Shu 3 1 Department of Computer Science, The Hong Kong University of Science and Technology Clear Water Bay, Hong Kong 2 Department of Electrical and Electronic Engineering, The University of Hong Kong Pokfulam Road, Hong Kong 3 Department of Electrical and Computer Engineering University of Central Florida, Orlando, FL , USA iahmad@cs.ust.hk, ykwok@eee.hku.hk, {wu, shu}@eee.engr.ucf.edu Corresponding Author: Ishfaq Ahmad Revised: May 1999 Existing parallel machines provide tremendous potential for high performance but their programming can be a cumbersome and error-prone process. This process is multi-phase in nature and consists of designing an appropriate parallel algorithm for the application at hand, implementing the algorithm by partitioning control and data, scheduling and mapping of the partitioned program onto the processors, orchestrating communication and synchronization, and identifying and interpreting various performance measures. A number of these phases, such as scheduling, mapping, communication, etc., can be very tedious if done manually, and thus should be done automatically. On the other hand, some of the more complex phases, such as parallelization, are better done semi-automatically. Software tools providing automatic functionalities free programmers from the nuisance of manual labor and can ensure better performance through code restructuring and optimization. This paper describes an experimental software tool called CASCH (Computer Aided SCHeduling) for parallelizing and scheduling applications on message-passing multiprocessors. CASCH transforms a sequential program to a parallel program with automatic scheduling, mapping, communication, and synchronization. The major strength of CASCH is its extensive library of scheduling and mapping algorithms representing a broad range of state-of-the-art work reported in the recent literature. These algorithms can be interactively analyzed, tested and compared using real data on a common platform with various performance objectives, enabling the programmer to select the most suitable algorithm for the application. CASCH with its graphical interface can be auspicious for both naive and expert programmers of parallel machines, and can also serve as a teaching and learning aid for understanding scheduling and mapping algorithms. Keywords: Automatic parallelization, scheduling, parallel programs, task graphs, message-passing multiprocessors, software tools.
2 1 Introduction Automated parallel programming environments are highly desirable by the programmers of parallel computers. Software tools embedded in a parallel programming environment can carry out a number of tasks, such as interprocessor communication and proper scheduling, freeing an average programmer from the major hurdles of parallelization and potentially improving the performance. Since these tasks can be quite tedious if done manually, the availability of automated tools is useful for an experienced programmer as well. Even though a large body of literature exists in the area of scheduling and mapping [1], [6], [7] (see Sidebar 1), a little portion of such knowledge has been exploited for practical purposes. Some software tools supporting automatic scheduling and mapping have been proposed but their main function is to provide a simulation environment [5]. While they can help in understanding the operation and behavior of scheduling and mapping algorithms, they are inadequate for practical purposes. On the other end of the spectrum, a large number of parallelizing tools (see Sidebar 2) have been proposed but they are usually not well integrated with sophisticated scheduling algorithms. This paper describe a software tool called CASCH (Computer Aided SCHeduling) for parallel processing on distributed-memory multiprocessors. CASCH is aimed to be a complete parallel programming environment, including parallelization, partitioning, scheduling, mapping, communication, synchronization, code generation, and performance evaluation. Program parallelization is performed by a compiler that automatically converts sequential applications into parallel codes. The parallel code to be executed on a target machine is optimized through proper scheduling and mapping,. CASCH is a unique tool in that it provides all of the important ingredients for developing parallel programs. It is useful for a naive programmer since its parallelization and code generation are done automatically. It can also help an experienced researcher since it provides various facilities to fine-tune and optimize a program. CASCH includes an extensive library of state-of-the-art scheduling algorithms from the recent literature. The library of scheduling algorithms is organized into different categories that are suitable for different architectural environments. The user can select one of these algorithms for scheduling the task graph generated from the application. The weights on the nodes and edges of the task graph are inserted using a database that contains the timing of various computation, communication, and I/O operations for different machines. These timings have been obtained through benchmarking. An attractive feature of CASCH is its graphical user interface which provides a flexible and easy-touse interactive environment for analyzing various scheduling and mapping algorithms, using task graphs generated randomly, interactively, or directly from real programs. The best schedule can be selected to be used by the code generator to generate a parallel code for a particular machine; the same process can be repeated for another machine. CASCH also can be used as a teaching aid for learning scheduling and mapping algorithms since it allows an interactive creation of task graphs and machine topologies, and provides a trace of a schedule which permits the identification of the order in which tasks are scheduled by a particular algorithm
3 The rest of this paper is organized as follows. Section 2 gives an overview of CASCH and describes it major functionalities. Section 3 includes the results of the experiments conducted on an Intel Paragon and an IBM SP2 using CASCH. The last section includes a discussion of future work and some concluding remarks. The first sidebar is a survey of scheduling algorithms and includes a taxonomy to classify the algorithms. The second sidebar discusses the related work and provides an overview of the reported programming and scheduling tools. 2 Overview of CASCH The overall organization of CASCH is shown in Figure 1. The main components of CASCH are: a compiler which includes a lexical analyzer and a parser; a task graph generator; a weight estimator; a scheduling/mapping module; a communication inserter; an interactive user interface; a code generator; a performance evaluation module. Interactive User Interface Weight Estimator Graphical Editing Tools Architectures Display Sequential User program Clusters of Workstations Intel Paragon Computation Timings Task Graphs Display Lexer and Parser Communication Timings Gantt Charts Communication Traffic Symbolic Table Task Graph Generator Scheduler Mapper Input/Output Timings UNC Algorithms BNP Algorithms APN Algorithms Mapping Algorithms Scheduling/Mapping Module EZ by Sarkar LC by Kim & Browne DSC by Yang & Gerasoulis MD by Wu & Gajski DCP by Kwok & Ahmad Communication Inserter HLFET by Hu ISH by Krautrachue & Lewis MCP by Wu & Gajski Performance Evaluation Module Application Statistics Machine Statistics Code Generator Parallel Testing Performance Reports Program ETF by Hwang et al. by Sih & Lee LAST by Baxter & Patel MH by Lewis & Rewini by Sih & Lee BU by Mehdiratta & Ghose BSA by Kwok & Ahmad Figure 1: The various components and functionalities of CASCH
4 These components are described below. 2.1 User Programs Using the CASCH tool, the user first writes a sequential program from which a task graph is generated. To facilitate the automation of program development, the sequential program is composed of a set of procedures called from the main program. A procedure, which should be written using the singleassignment rule, is an indivisible unit of computation to be scheduled on one processor. The grain sizes of procedures are determined by the programmer, and can be modified. Figure 2 shows an example (an implementation of a fast Fourier transform algorithm) in which the data matrix is partitioned by columns across processors. In the serial program, the constant N = PN SN determines the problem size. Specifically, the constants PN and SN control the granularity of the partitioning: the larger the value of SN, the higher the granularity. In the current implementation of CASCH, these constants are defined by the user at compile-time. The procedures InterMult and IntraMult are called several times. The control dependencies can be ignored, so that a procedure call can be executed whenever all input data of the procedure are available. Data dependencies are defined by the single assignment of parameters in procedure calls. Communications can be invoked only at the beginning and the end of procedures. In other words, a procedure receives messages before it begins execution, and sends messages after it has finished execution. In general, the user is required to implement the application (e.g., FFT) only in the form of a sequential program consisting of a set of procedures. The sequential program is basically an ordinary C program except that the user has to insert a few annotations in the form of #define compiler directives which instruct CASCH to invoke the partitioning of data arrays. For instance, in the FFT example, the user just needs to define values of PN and SN in the header of the sequential C program. In this example, PN = 4 and SN = Lexical Analyzer and Parser The lexical analyzer and parser analyze the data dependencies and user defined partitions. In our implementation of CASCH, both components are constructed with the help of lex and yacc. If a syntax or semantic error is discovered in this stage, the user is advised to fix the problem before proceeding to the task graph generation phase. For a static program, the number of procedures are known before program execution. Many numerical types of applications belong to this static class of programs [8]. Such a program is system independent since communication primitives are not specified in the program. Data dependencies among the procedural parameters define a macro dataflow graph (i.e., the task graph) [10]. 2.3 Task Graph Generation A macro dataflow graph, which is generated directly from the main program, is a directed acyclic graph (DAG) with start and end points. A macro dataflow graph consists of a set of tasks { T 1, T 2,, T n } and a set of edges { e 1, e 2,, e m } such that e k = T i T j for 1 k m and some i, j where 1 i, j n
5 Each node in the graph corresponds to a procedure or a task, and the node weight is represented by the procedure execution time. Each edge corresponds to a message transferred from one procedure to another procedure, and the weight of the edge is equal to the transmission time of the message. When two tasks are scheduled to a single processor, the weight of the edge connecting them becomes zero. 2.4 Weight Estimator The weights on the nodes and edges of the task graph are inserted with the help of an estimator that provides the execution times costs of various instructions as well as the cost of communication on a given machine. These timings have been obtained through benchmarking using an approach similar to [2], [4]. Communication estimation, which is obtained experimentally, is based on the cost for each communication primitive, such as send, receive, and broadcast. Our approach is similar to that used by Xu and Hwang [11]. Table 1 shows the communication times (assuming a stand-alone mode) for various target machines. Table 1: Communication timing constants (microseconds) for various target machines. Machine Start-up Rate per byte 1/ClockRate Intel Paragon IBM SP The current version of the computation estimator is a symbolic estimator. The estimation is based on reading through the code without running it. Its symbolic output is in the form of a function of input parameters of the code. With a symbolic estimator and a restricted class of C codes, the code does not need re-estimation for different problem sizes. The code may include functions and procedures, and the estimator generates performance for each of them. The code may have for loops. The boundaries of a loop can be either constants or input parameters. The cost of each operation or built-in function is specified in the cost files. The total amount of computation can be obtained by summing all costs of operations and functions for a segment of code. 2.5 Task Scheduling and Mapping A common approach to distribute the workload among p processors is to partition a problem into p tasks and perform a one-to-one mapping between the tasks and the processors. Partitioning can be done with the block, cyclic, or block-cyclic pattern [10]. Such partitioning schemes using simple scheduling heuristics such as the owner computes rule work for certain problems but could fail for many others, especially for irregular problems, as it is difficult to balance the load and minimize dependencies simultaneously. An irregular problem is handled by partitioning it into many tasks which are scheduled in order to balance the load and minimize communication. In CASCH, a task graph generated based on this partitioning is scheduled using a scheduling algorithm. Since, one scheduling algorithm may not be suitable for a certain problem on a given architecture [7], CASCH includes various algorithms which are We shall use node and task synonymously
6 Program FFT /* N: number of points for discrete Fourier transform, let N=PNxSN */ /* data[log(pn)+2][pn][sn] */ /* stores single-assigned data points for discrete Fourier */ /* transform organized as a PN x SN grid for parallel computation */ /* main program */ call Initiation; /* serial part; initialize the array `data' */ /* parallel inter-multiplication of data points */ for i = log(pn) downto 1 do for j = 0 to PN-1 step 1<<i do for k = 0 to 1<<(i-1)-1 do call InterMult(data[i+1][j+k], data[i+1][j+k+1<<(i-1)], data[i][j+k], SN); call InterMult(data[i+1][j+k+1<<(i-1)], data[i+1][j+k], data[i][j+k+1<<(i-1)], SN); /* in each iteration, InterMult can be executed if */ /* arrays data[i+1][j+k] and data[i+1][j+k+1<<(i-1)]*/ /* are available upon completion, data[i][j+k] and */ /* data[i][j+k+1<<(i-1)] will be avaliable */ endfor endfor endfor /* parallel intra-multiplication of data points */ for i = 0 to PN-1 do call IntraMult(data[1][i], data[0][i], SN); /* in each iteration, IntraMult can be executed if array */ /* data[1][i] is avaliable; upon completion, data[0][i], */ /* which is the result, will be avaliable */ endfor call OutputResult; /* serial part; inverse and return results */ EndProgram FFT /* Procedure InterMult */ Procedure InterMult(inArray1, inarray2, outarray, n) /* Input: inarray1, inarray2 data points for multiplication */ /* n number of data points in sub-array */ /* Output: outarray array of output data */ for i = 0 to n-1 do outarray[i] = inarray2[i];/* '@' is element-wide */ /* complex FFT operation*/ endfor EndProcedure InterMult /* Procedure IntraMult */ Procedure IntraMult(inArray, outarray, n) /* Input: inarray data points for multiplication */ /* n number of data points in sub-array */ /* Output: outarray array of output data */ for i = log(n) downto 1 do for j = 0 to n-1 step 1<<i do for k = 0 to 1<<(i-1)-1 do outarray[j+k] = inarray[j+k+1<<(i-1)]; outarray[j+k+1<<(i-1)] = inarray[j+k]; /* where '@' is element-wide complex FFT operation */ endfor endfor for j = 0 to n-1 do inarray[j] = outarray[j]; endfor endfor EndProcedure IntraMult Figure 2: A sequential program for fast Fourier transform. suitable to various environments. The advantages of having a wide variety of algorithms in CASCH are: The diversity of these heuristic algorithms allows the user to select a type of algorithm that is suitable to a particular architectural configuration
7 The common platform provided by CASCH allows simultaneous comparisons among various algorithms, based on a number of performance objectives such as schedule length, number of processors used, algorithm s running time, etc. The comparison among the algorithms can be done using manually-generated task graphs as well as real data measured at execution time of a number of applications. For a given application program, the user can optimize the code by running various scheduling algorithms and then choose the best schedule. 2.6 Communication Inserter Synchronization among the tasks running on multiple processors is carried out by communication primitives. The basic communication primitives for exchanging messages between processors are send and receive. They must be used properly to ensure a correct sequence of computation. These primitives can be inserted automatically, reducing the programmer s burden and eliminating insertion errors. The procedure for inserting communication primitive is as follows. After scheduling and mapping, each task has been allocated to a processor. If an edge leaves from a task to another task which belongs to a different processor, the send primitive is inserted after the task. Similarly, if an edge comes from another task in a different processor, the receive primitive is inserted before the task. The insertion method described above does not ensure a correct communication sequence because a deadlock may occur. Thus, we use a send-first strategy for a reordering of communication primitives. That is, we reorder receives according to the order of sends. The communication primitive insertion algorithm is described below. Communication Insertion Algorithm: Assume that after scheduling and mapping each task of the task graph is allocated to processor MT ( i ), where M is a function mapping a task number to a processor number. (1) For each edge e k from task T i to T j for which MT ( i ) MT ( j ), insert a send primitive after task in processor MT ( i ), denoted by Se ( k, T i, MT ( j )); insert a receive primitive before task T j in processor MT ( j ), denoted by Re ( k, T j, MT ( i )). Once a message has been scheduled to be sent to a processor, eliminate other sends and receives that transfer the same message to the same processor. Now, for each processor, we have a sequence, Xe ( m1, T m1, P m1 ), Xe ( m2, T m2, P m2 ),, where X could be either S or R. (2) For each pair of processors, say P 1 and P 2, extract all Se ( mi, T mi, P 2 ) from processor P 1 to form a subsequence S P1, and extract all Re ( mj, T mj, P 1 ) from processor P 2 to form a subsequence R P2. Step 2.1: Within each segment of the subsequence S P1 with the same task number, exchange the order of sends according to the order of receives as defined by the subsequence R P2. Step 2.2: If the two resultant subsequences are still not matched with each other, R P2 is reordered according to the order of. 2.7 Code Generation S P1 We use the example of Figure 2 to illustrate our code generation method. Figure 3 shows the generated T i T i - 6 -
8 parallel code for three processors (assuming N = 8 ). Note that only the main program for each processor is shown. The data structure is the same as in Figure 2. In this example, the initial data is stored at processor P0. Data is transmitted to other processors such that each processor obtains the portion of data required for its computation. Consequently, the memory space is compacted. To reduce the number of message transfers and, consequently, the time to initiate messages, several messages can be packed and sent together. For example, the first four messages can be packed into one message and sent to processor P0. Such optimizations are also implemented in CASCH. Finally, the fourth data partition of the result is received from processor P0, the third from processor P1, and the first and the second from processor P2. /* For P0 */ /* load array of data points from HOST */ receive(host, data[3][0]); receive(host, data[3][1]); receive(host, data[3][2]); receive(host, data[3][3]); InterMult(data[3][3],data[3][1],data[2][3],2); send(p1, data[2][3]); InterMult(data[3][1],data[3][3],data[2][1],2); InterMult(data[3][2],data[3][0],data[2][2],2); send(p1, data[2][2]); InterMult(data[3][0],data[3][2],data[2][0],2); InterMult(data[2][1],data[2][0],data[1][1],2); send(p2, data[1][1]); InterMult(data[2][0],data[2][1],data[1][0],2); send(p2, data[1][0]); InterMult(data[2][3],data[2][2],data[1][3],2); IntraMult(data[1][3],data[0][3],2); /* unload result array of data points to HOST */ send(host, data[0][3]); /* For P1 */ receive(p0, data[2][2]); receive(p0, data[2][3]); InterMult(data[2][2],data[2][3],data[1][2],2); IntraMult(data[1][2],data[0][2],2); /* unload result array of data points to HOST */ send(host, data[0][2]); /* For P2 */ receive(p0, data[1][1]); IntraMult(data[1][1],data[0][1],2); receive(p0, data[1][0]); IntraMult(data[1][0],data[0][0],2); /* unload result array of data points to HOST */ send(host, data[0][1]); send(host, data[0][0]); Figure 3: The parallel code for fast Fourier transform. 2.8 Graphical User Interface The graphical capabilities of CASCH provide the user with a an easy-to-use window-based interactive interface. The graphical interface includes the following facilities, which map to the buttons shown in Figure
9 Figure 4: The main menu of CASCH. Source: The user can create, edit, or browse through sequential programs. The source button also includes a sub-menu for generating a task graph from the user program. DAGs: This includes a facility to display a task graph (i.e., a DAG) generated from the user program (Figure 5 shows the DAG for the FFT program). Other options include the display of a randomly generated DAG or the interactive creation of a DAG. Zooming facilities (horizontally, vertically, or both) are included for proper viewing. Figure 5: Display of the DAG for the FFT program. TIGs: This facility displays task graphs, which are TIG (task interacting graphs with undirected edges). This facility is similar in functionality to DAGs. Processor Network: This facility allows the user to display a processor architecture (including the processors and the network topology). The editing facilities, similar to DAGs, allow the user to interactively create various network topologies. An example processor graph is illustrated in Figure 6. Scheduling: This facility includes a sub-menu from which the user must first select one of the three classes of the scheduling algorithms, i.e., BNP, UNC, and APN. Within each class, the user needs to - 8 -
10 Figure 6: Display of processor graph. select one of the scheduling algorithms. The scheduling algorithm requires the user to enter a number of parameters. Show Schedule: The schedule generated as the result of invoking a scheduling algorithm can be displayed using this facility (Figure 7 and Figure 8 show the scheduling of the FFT example by five different algorithms). A schedule is displayed using a Gantt chart showing the start and finish times of tasks on various processors. Clicking on any task in the Gantt chart displays its start and finish times; the total schedule length is shown in the right corner of the window. A schedule also includes communication messages on the network (displayed through another window which is invoked by clicking on any two processors). An important feature of this facility is the trace option which shows a step-by-step scheduling of each task. This is very useful for understanding the operation of a scheduling algorithm through observation of the order in which tasks are scheduled by the algorithm. Multiple such charts can be opened concurrently allowing a comparison among the schedules generated by various algorithms. Indeed, in most cases, it may be necessary to try different algorithms. Two additional scheduling examples are depicted by Figure 9 and Figure 10. Mapping: This set of facilities includes a number of mapping algorithms that are used to map TIGs onto the processors. At present, CASCH includes algorithms based on A*, recursive clustering, and simulated annealing [9]. Some scheduling algorithms (such as UNC algorithms) may first generate clusters that need to be mapped onto the processors using one of these mapping algorithms. Show Mapping: This shows an assignment of tasks to the processors generated by a mapping algorithm. The display includes a TIG in which a processor number is attached to each task (indicating the processor number to which this task is allocated). Code Generation: The code generator generates the parallel code for a given program according to a schedule/map generated by a scheduling/mapping algorithm. Performance: The performance facilities include processors utilization, time spent in computation For definitions of these terms, see sidebar, Recent Research in Multiprocessor Scheduling A Brief Introduction. Tasks 1 and 14 are shown as thin rectangles due to their small weights. A mapping algorithm is required if scheduling and mapping are done separately
11 (a) The schedule generated by the DCP algorithm (schedule length = 71). (b) The schedule generated by the DSC algorithm (schedule length = 68). (c) The schedule generated by the MD algorithm (schedule length = 78). Figure 7: Display of the Gantt charts showing the schedules generated by three UNC algorithms for the FFT program
12 Figure 8: Display of the Gantt chart showing the schedules generated by the MCP and ETF algorithms for the FFT program (schedule length = 83). Figure 9: A Gaussian elimination task graph. and communication, and speedup. The computation and communication timing results are obtained by inserting the dclock() procedure call before and after each inter-task communication. Data Partitioning: This facility includes tools for displaying structured and unstructured meshes as well as partitioning of data across different processors
13 (a) The schedule generated by the DCP algorithm (schedule length = 439). (b) The schedule generated by the MCP and ETF algorithms (schedule length = 431). Figure 10: The schedules for the Gaussian elimination graph (Figure 9) produced by two scheduling algorithms. 3 Results CASCH runs on a SUN workstation that is linked through a network to an Intel Paragon and an IBM SP2. We have parallelized several applications on CASCH by using several of the scheduling algorithms described above. Here we discuss some preliminary results obtained by measuring the performance of three applications: FFT, Laplace equation solver, and N-Body problem. These results demonstrate the viability and usefulness of CASCH as well as make a comparison among various scheduling algorithms. For reference, the results obtained with code generated by random scheduling of tasks are included. Such target code is generated by first partitioning the data among processors in a fashion that reduces the dependencies among the partitions. Based on this partitioning, an SPMD-based code is generated by randomly allocating the tasks to the processors. Hereafter, randsch is used to denote the results of these randomly scheduled programs. The first set of results (see Table 2) are for the FFT example shown earlier with four different sizes of
14 input data: 512, 1024, 2048, and 4096 points. Table 2 shows the execution times for various data sizes using different scheduling algorithms. Each value is an average of ten runs on the Intel Paragon and IBM SP2. The paragon consists of 140 i860/xp processors with a clock speed of 50MHz while the SP2 consists of MHz IBM P2SC processors. Table 2: Execution times of the FFT application for all the scheduling algorithms on the Intel Paragon and IBM SP2. Number of Points Algorithm randsch Paragon SP2 Paragon SP2 Paragon SP2 Paragon SP UNC BNP APN DCP DSC EZ LC MD ETF HLFET ISH LAST MCP BSA BU MH We observe that the execution times vary considerably with different algorithms. Among the UNC algorithms, the DCP algorithm yields the best performance due to its superior scheduling method; it also yields the best performance overall. Among the BNP algorithms, MCP and are in general better, primarily because of their better task priority assignment methods. Among the APN algorithms, BSA and MH perform better, due to their proper allocation of tasks and messages. All algorithms perform better than randsch: compared to the random scheduling, the level of performance improvement is up to 400%. Our second application is based on a Gauss-Seidel algorithm to solve Laplace equations. The four matrix sizes used are 8, 16, 32, and 64. The application execution times using various algorithms and data size are shown in Table 3. Again, using the DCP algorithm, more than 400% improvement over randsch is obtained. The UNC algorithms in general yield better schedules. The third application is the N-Body problem. The execution times results are shown in Table 4. Again, the scheduling algorithms demonstrate a similar trend in application execution times on both parallel machines as in the previous two applications. The running times of the scheduling algorithms for the three applications are shown in Table 5. Here, we can see that some scheduling algorithms take much longer times than the others due to their higher complexities (for details about the complexities of the algorithms, the reader is referred to [7]). Thus, there is a trade-off between the performance and the speed of a scheduling algorithm. For example, the DCP algorithm can generate better solutions the DSC algorithm but it is slower
15 Table 3: Execution times of the Laplace equation solver application for all the scheduling algorithms on the Intel Paragon and IBM SP2. Matrix Dimension UNC BNP APN Algorithm randsch DCP DSC EZ LC MD ETF HLFET ISH LAST MCP BSA BU MH Paragon SP2 Paragon SP2 Paragon SP2 Paragon SP Algorithm randsch Table 4: Execution times of the N-Body application for all the scheduling algorithms on the Intel Paragon and IBM SP2. Number of Points Paragon SP2 Paragon SP2 Paragon SP2 Paragon SP UNC BNP APN DCP DSC EZ LC MD ETF HLFET ISH LAST MCP BSA BU MH One important point to note from the above preliminary experimental results is that the performance of the scheduling algorithms can have substantial variations. For instance, even though the average performance of the DCP algorithm is better overall, it does perform worse in some cases. Thus, the user may have to try different schedulers to obtain the best results for the application at hand
16 Table 5: Scheduling times (seconds) for the applications on a SPARC Station 2 for all the scheduling algorithms. Number of Points Matrix Dimension Algorithm Algorithm UNC DCP DSC EZ LC MD UNC DCP DSC EZ LC MD BNP ETF HLFET ISH LAST MCP BNP ETF HLFET ISH LAST MCP APN BSA BU MH APN BSA BU MH (a) FFT application. (b) Laplace equation solver application. Number of Points UNC Algorithm DCP DSC EZ LC MD BNP ETF HLFET ISH LAST MCP APN BSA BU MH (c) N-Body application. 4 Conclusions and Future Work CASCH achieves the objectives of automatic parallelization and scheduling of applications by providing a unified environment for various existing and conceptual machines. A combination of a parallel code generator and a scheduler allows to test new ideas in scheduling with real applications instead of just simulations. CASCH also provides a framework for users to compare various scheduling algorithms and optimize their code by choosing the best algorithm. As shown in the experimental results, even an effective scheduling heuristic can sometimes produce inferior solutions. The extensive scheduling algorithms
17 library in CASCH, which includes various state-of-the-art scheduling algorithms, allows the user to optimize the execution of a parallel application by choosing the best schedule with the help of the interactive scheduling interface. We are currently working on extending the capabilities of CASCH by including the following: Support of distributed computing systems such as a collection of diverse machines working as a distributed heterogeneous supercomputer system; Extension of the current database of benchmark timings by including more detailed and lower level timings of various computation, communication and I/O operations of various existing machines; Inclusion of debugging facilities for error detection and global variable checking, etc.; Design and implementation of partitioners for automatic or interactive partitioning of programs; Design of an intelligent tool that will select an appropriate scheduling algorithm for a given application; and Enhance the task graph module so that huge task graphs (e.g., for Laplace equation of a larger matrix size) can be handled; in this regard, the parameterized task graph (PTG) technique proposed by Cosnard and Loi [3] is being considered for implementation. Acknowledgments The authors would like to thank the referees for their constructive and insightful comments which have greatly improved the presentation of this paper. Preliminary versions of portions of this paper have been presented at the 1997 International Conference on Parallel Processing, and the 3rd European Conference on Parallel Processing. This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST RI 93/94.EG06 and HKUST734/97E. References [1] E.G. Coffman, Computer and Job-Shop Scheduling Theory, Wiley, New York, [2] W.W. Chu, M.-T. Lan, and J. Hellerstein, Estimation of Intermodule Communication (IMC) and Its Applications in Distributed Processing Systems, IEEE Trans. Computers, vol. C-33, no. 8, pp , Aug [3] M. Cosnard and M. Loi, Automatic Task Graphs Generation Techniques, Parallel Processing Letters, vol. 5, no. 4, pp , Dec [4] A. Ghafoor and J. Yang, A Distributed Heterogeneous Supercomputing Management System, IEEE Computer, vol. 26, no. 6, pp , June [5] S.J. Kim and J.C. Browne, A General Approach to Mapping of Parallel Computation upon Multiprocessor Architectures, Proc. Int l Conf. Parallel Processing, vol. II, pp. 1-8, Aug [6] Y.-K. Kwok and I. Ahmad, Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs onto Multiprocessors, IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 5, pp , May [7] Y.-K. Kwok and I. Ahmad, Benchmarking the Task Graph Scheduling Algorithms, Proc. Int l Parallel Processing Symposium, pp , Apr [8] R.E. Lord, J.S. Kowalik, and S.P. Kumar, Solving Linear Algebraic Equations on an MIMD Computer, J. ACM, vol. 30, no. 1, pp , Jan [9] T.M. Nabhan and A.Y. Zomaya, A Parallel Simulated Annealing Algorithm with Low Communication Overhead, IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 12, pp
18 1233, Dec [10] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press, Cambridge, MA, [11] Z. Xu and K. Hwang, Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2, IEEE Parallel and Distributed Technology, vol. 4, no. 1, pp. 9-23, Spring Sidebar 1: Recent Research in Multiprocessor Scheduling A Brief Introduction The most common models of a parallel program are the precedence-constrained directed acyclic graph (DAG) and the task interacting graph (TIG) with no temporal dependencies. Figure 11(a) shows a parallel loop nest and Figure 11(b) depicts the DAG representing the loop. (1) Data 0, j := 0; for all j (3) for i = 1 to 3 do in parallel (4) for j = 1 to 3-(i-1) do in parallel (5) Task ij (INPUT: Data i-1, j, Data i-1, j+1 ; OUTPUT: Data ij ) (a) A parallel loop nest. Task 11 Task 12 Task 13 Data 11 Task 21 Data 12 Task22 Data 13 Data 21 Data 22 Task 31 Figure 11: (a) A parallel program fragment and (b) a directed acyclic graph representing the program fragment. (b) A DAG modeling the loop nest. The weight associated with a node represents the amount of execution time of the corresponding task and the weight associated with an edge represents the amount of communication time. Numerous techniques have been proposed in the literature for generating the node and edge weights off-line such as execution profiling and analytical benchmarking [5]. With such a static model, a scheduler is invoked offline during compile-time. This form of the multiprocessor scheduling problem is called static scheduling or DAG scheduling. Figure 12 provides a taxonomy of static parallel scheduling algorithms. The taxonomy is partial since it does not include details of some of the earlier work on scheduling. Only those scheduling algorithms that can be used in a realistic environment and are relevant in our context are considered. The taxonomy is hierarchical and develops by expanding each layer. Thick arrows indicate the relevance to our discussion and a further division of a particular layer; the thin arrows do not lead to a further division in the taxonomy. The highest level of the taxonomy is divided into two categories, depending upon whether the tasks are independent or not. This discussion is limited to dependent tasks. Earlier algorithms make simplifying assumptions about the task graph representing the program and the model of the multiprocessor system. Some algorithms ignore the precedence constraints and considered the task graph to be free of temporal dependencies (task interacting graph). The algorithms considering the more realistic task precedence
19 constrained graph assume the graph to be of a special structure such as tree, forks-join, etc. In general, however, parallel programs come in a variety of structures. The algorithms designed to tackle arbitrary graph structures can be divided further into two categories. Some algorithms assume the computational costs of all the tasks to be the same; others assume the computational costs of tasks to be arbitrary. It is worth mentioning that the scheduling problem is NP-complete even in two simple cases: (1) scheduling unit-time tasks to an arbitrary number of processors, (2) scheduling one or two time unit tasks to two processors. Scheduling with communication may be done with or without duplication of tasks [3]. Each class can be further divided into two categories. Note that, only the division of No-Duplication class is shown. An exact division of Duplication can also be envisaged but is not shown here due to its similarity with the No- Duplication class. Some scheduling algorithms assume the availability of an unlimited number of processors [2], [6], [7], [9], with a fully connected network. These are called the UNC (unbounded number of clusters) scheduling algorithms. The algorithms assuming a limited number of processors are called the BNP (bounded number of processors) scheduling algorithms. In the UNC and BNP scheduling algorithms, the processors are assumed to be fully-connected, and link contention or routing strategies used for communication are ignored. If scheduling and mapping are done in separate steps, the schedules generated by the UNC or BNP algorithms can be mapped onto the processors using the indirect mapping approach. The algorithms that assume the system to have an arbitrary network topology are called the APN (arbitrary processor network) scheduling algorithms [8]. The basis of a major component of scheduling algorithms (in all three classes) is the classical list scheduling approach [1], [4]. In list scheduling the tasks are assigned priorities and placed in a ready list arranged in a descending order of priority. The task with a higher priority is examined for scheduling before a task with a lower priority. If more than one task has the same priority, ties are broken using some method. A task selected for scheduling is allocated to a processor that allows the earliest start time. After a task is scheduled, more tasks may be added in the ready list. Again, the tasks in the ready list are examined and scheduled. This continues until all tasks are scheduled. The scheduling algorithm library of CASCH includes five UNC, six BNP, and four APN algorithms. The major characteristics of these algorithms are briefly described below. The reader is referred to [2] for more detailed description and comparison. References for Sidebar 1: [1] T.L. Adam, K.M. Chandy, and J. Dickson, A Comparison of List Scheduling for Parallel Processing Systems, Comm. ACM, vol. 17, no. 12, pp , Dec [2] I. Ahmad, Y.-K. Kwok, and M.-Y. Wu, Analysis, Evaluation, and Comparison of Algorithms for Scheduling Task Graphs on Parallel Processors, Proc. 2nd Int l Symposium on Parallel Architecture, Algorithms, and Networks, pp , June [3] I. Ahmad and Y.-K. Kwok, On Exploiting Task Duplication in Parallel Program Scheduling, IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 9, pp , Sept
20 Static Parallel Scheduling Independent Tasks Multiple Interacting Tasks Task Interaction Graph Task Precedence Graph Arbitrary Graph Structure Restricted Graph Structure Arbitrary Computational Costs Unit Computational Costs No Communication With Communication No Duplication Duplication Unlimited Number of Processors Limited Number of Processors Processors Fully Connected Processors Arbitrarily Connected UNC Algorithms BNP Algorithms APN Algorithms Figure 12: A partial taxonomy of the multiprocessor scheduling problem. [4] H. El-Rewini and T.G. Lewis, Scheduling Parallel Programs onto Arbitrary Target Machines, J. Parallel and Distributed Computing, vol. 9, no. 2, pp , June [5] K. Hwang, Z. Xu, and M. Arakawa, Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing, IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 5, pp , May [6] S.J. Kim and J.C. Browne, A General Approach to Mapping of Parallel Computation upon Multiprocessor Architectures, Proc. Int l Conf. Parallel Processing, vol. II, pp. 1-8, Aug [7] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press, Cambridge, MA, [8] G.C. Sih and E.A. Lee, A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures, IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 2, pp , Feb [9] T. Yang and A. Gerasoulis, DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors, IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 9, pp , Sept Sidebar 2: Other Parallel Programming Tools Several research efforts have demonstrated the usefulness of program development tools for parallel processing on message-passing multiprocessors. Essentially, these tools fall into two classes. The first
21 class, which is mainly comprised of commercial tools, provides software development and debugging environments. The ATEXPERT by Cray Research [1] is an example. Some of these tools also provide performance tuning tools and other program development facilities. The second class performs some program transformation through program restructuring. PARASCOPE [3] and TINY [9] are restructuring tools that automatically transform sequential programs to parallel programs. TOP/DOMDEC [2] is a tool for program partitioning. Some of the recently reported prototype scheduling tools are described below. PAWS [6] is a performance evaluation tool that provides an interactive environment for performance evaluation of various multiprocessor systems. PAWS does not perform scheduling and mapping, and does not generate any code. It is useful only for simulating the execution of an application on various machines. Hypertool [10] takes a user-partitioned sequential program as input and automatically allocates and schedules the partitions to processors. Proper synchronization primitives also are automatically inserted. Hypertool is a code generation tool since the user program is compiled into a parallel program for the ipsc/2 hypercube computer using parallel code synthesis and optimization techniques. The tool also generates performance estimates including execution time, communication and suspension times for each processor, as well as network delay for each communication channel. Scheduling is done using the MD algorithm or the MCP algorithm. PYRROS [8] is a compile-time scheduling and code generation tool. Its input is a task graph and the associated sequential C code. The output is a static schedule and a parallel C code for ipsc/2. PYRROS consists of a task graph language with an interface to C, a scheduling system that uses only the DSC algorithm, an x-windows based graphic displayer, and a code generator. The task graph language allows the user to define partitioned programs and data. The scheduling system is used for clustering the task graph, performing load balanced mapping, and computation/communication ordering. The graphic displayer is used for displaying task graphs and scheduling results in the form of Gantt charts. The code generator inserts synchronization primitives and performs parallel code optimization for the target parallel machine. Parallax [4] incorporates seven classical scheduling heuristics designed in the seventies, providing an environment for parallel program developers to find out how the schedulers affect program performance on various parallel architectures. Users must provide the input program as a task graph and estimate task execution times. Users must also express the target machine as an interconnection topology graph. Parallax then generates schedules in the form of Gantt charts, speedup curves, processor and communication efficiency charts using an x-windows interface. In addition, an animated display of the simulated running program helps developers to evaluate the differences among the provided scheduling heuristics. Parallex, however, is not reported to generate executable parallel code. OREGAMI [5] is designed for use in conjunction with parallel programming languages that support a communication model, such as OCCAM, C*, or with traditional programming languages like C and FORTRAN extended with communication facilities. As such, it is a set of tools that includes a LaRCS compiler to compile textual user task descriptions into specialized task graphs, which are called TCG
Programmers of parallel computers regard automated parallelprogramming
Automatic Parallelization Ishfaq Ahmad Hong Kong University of Science and Technology Yu-Kwong Kwok University of Hong Kong Min-You Wu, and Wei Shu University of Central Florida The authors explain the
More informationParallel Program Execution on a Heterogeneous PC Cluster Using Task Duplication
Parallel Program Execution on a Heterogeneous PC Cluster Using Task Duplication YU-KWONG KWOK Department of Electrical and Electronic Engineering The University of Hong Kong, Pokfulam Road, Hong Kong Email:
More informationBenchmarking and Comparison of the Task Graph Scheduling Algorithms
Benchmarking and Comparison of the Task Graph Scheduling Algorithms Yu-Kwong Kwok 1 and Ishfaq Ahmad 2 1 Department of Electrical and Electronic Engineering The University of Hong Kong, Pokfulam Road,
More informationA Comparison of Task-Duplication-Based Algorithms for Scheduling Parallel Programs to Message-Passing Systems
A Comparison of Task-Duplication-Based s for Scheduling Parallel Programs to Message-Passing Systems Ishfaq Ahmad and Yu-Kwong Kwok Department of Computer Science The Hong Kong University of Science and
More informationA Parallel Algorithm for Compile-Time Scheduling of Parallel Programs on Multiprocessors
A Parallel for Compile-Time Scheduling of Parallel Programs on Multiprocessors Yu-Kwong Kwok and Ishfaq Ahmad Email: {csricky, iahmad}@cs.ust.hk Department of Computer Science The Hong Kong University
More informationEfficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures
Efficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures By Yu-Kwong KWOK A Thesis Presented to The Hong Kong University of Science and Technology in Partial Fulfilment
More informationExploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems
Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems Yu-Kwong Kwok and Ishfaq Ahmad Department of Computer Science Hong Kong University of Science and
More informationA SIMULATION OF POWER-AWARE SCHEDULING OF TASK GRAPHS TO MULTIPLE PROCESSORS
A SIMULATION OF POWER-AWARE SCHEDULING OF TASK GRAPHS TO MULTIPLE PROCESSORS Xiaojun Qi, Carson Jones, and Scott Cannon Computer Science Department Utah State University, Logan, UT, USA 84322-4205 xqi@cc.usu.edu,
More informationA Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.
A Fast Recursive Mapping Algorithm Song Chen and Mary M. Eshaghian Department of Computer and Information Science New Jersey Institute of Technology Newark, NJ 7 Abstract This paper presents a generic
More informationA STUDY OF BNP PARALLEL TASK SCHEDULING ALGORITHMS METRIC S FOR DISTRIBUTED DATABASE SYSTEM Manik Sharma 1, Dr. Gurdev Singh 2 and Harsimran Kaur 3
A STUDY OF BNP PARALLEL TASK SCHEDULING ALGORITHMS METRIC S FOR DISTRIBUTED DATABASE SYSTEM Manik Sharma 1, Dr. Gurdev Singh 2 and Harsimran Kaur 3 1 Assistant Professor & Head, Department of Computer
More informationA Level-wise Priority Based Task Scheduling for Heterogeneous Systems
International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract
More informationStatic Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors
Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors YU-KWONG KWOK The University of Hong Kong AND ISHFAQ AHMAD The Hong Kong University of Science and Technology Static
More informationLecture 9: Load Balancing & Resource Allocation
Lecture 9: Load Balancing & Resource Allocation Introduction Moler s law, Sullivan s theorem give upper bounds on the speed-up that can be achieved using multiple processors. But to get these need to efficiently
More informationA taxonomy of application scheduling tools for high performance cluster computing
Cluster Comput (2006) 9:355 371 DOI 10.1007/s10586-006-9747-2 A taxonomy of application scheduling tools for high performance cluster computing Jiannong Cao Alvin T. S. Chan Yudong Sun Sajal K. Das Minyi
More informationContention-Aware Scheduling with Task Duplication
Contention-Aware Scheduling with Task Duplication Oliver Sinnen, Andrea To, Manpreet Kaur Department of Electrical and Computer Engineering, University of Auckland Private Bag 92019, Auckland 1142, New
More informationImplementation of Dynamic Level Scheduling Algorithm using Genetic Operators
Implementation of Dynamic Level Scheduling Algorithm using Genetic Operators Prabhjot Kaur 1 and Amanpreet Kaur 2 1, 2 M. Tech Research Scholar Department of Computer Science and Engineering Guru Nanak
More informationLink contention-constrained scheduling and mapping of tasks and messages to a network of heterogeneous processors
Cluster Computing 3 (2000) 113 124 113 Link contention-constrained scheduling and mapping of tasks and messages to a network of heterogeneous processors Yu-Kwong Kwok a and Ishfaq Ahmad b a Department
More informationMultiprocessor Scheduling Using Task Duplication Based Scheduling Algorithms: A Review Paper
Multiprocessor Scheduling Using Task Duplication Based Scheduling Algorithms: A Review Paper Ravneet Kaur 1, Ramneek Kaur 2 Department of Computer Science Guru Nanak Dev University, Amritsar, Punjab, 143001,
More informationCritical Path Scheduling Parallel Programs on an Unbounded Number of Processors
Critical Path Scheduling Parallel Programs on an Unbounded Number of Processors Mourad Hakem, Franck Butelle To cite this version: Mourad Hakem, Franck Butelle. Critical Path Scheduling Parallel Programs
More informationA Novel Task Scheduling Algorithm for Heterogeneous Computing
A Novel Task Scheduling Algorithm for Heterogeneous Computing Vinay Kumar C. P.Katti P. C. Saxena SC&SS SC&SS SC&SS Jawaharlal Nehru University Jawaharlal Nehru University Jawaharlal Nehru University New
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationMapping of Parallel Tasks to Multiprocessors with Duplication *
Mapping of Parallel Tasks to Multiprocessors with Duplication * Gyung-Leen Park Dept. of Comp. Sc. and Eng. Univ. of Texas at Arlington Arlington, TX 76019-0015 gpark@cse.uta.edu Behrooz Shirazi Dept.
More informationCS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel
CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999
More informationOn the Complexity of List Scheduling Algorithms for Distributed-Memory Systems.
On the Complexity of List Scheduling Algorithms for Distributed-Memory Systems. Andrei Rădulescu Arjan J.C. van Gemund Faculty of Information Technology and Systems Delft University of Technology P.O.Box
More informationA Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems
A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems Yi-Hsuan Lee and Cheng Chen Department of Computer Science and Information Engineering National Chiao Tung University, Hsinchu,
More informationAn Extension of Edge Zeroing Heuristic for Scheduling Precedence Constrained Task Graphs on Parallel Systems Using Cluster Dependent Priority Scheme
ISSN 1746-7659, England, UK Journal of Information and Computing Science Vol. 6, No. 2, 2011, pp. 083-096 An Extension of Edge Zeroing Heuristic for Scheduling Precedence Constrained Task Graphs on Parallel
More informationAPPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH
APPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH Daniel Wespetal Computer Science Department University of Minnesota-Morris wesp0006@mrs.umn.edu Joel Nelson Computer Science Department University
More informationParallel Code Generation in MathModelica / An Object Oriented Component Based Simulation Environment
Parallel Code Generation in MathModelica / An Object Oriented Component Based Simulation Environment Peter Aronsson, Peter Fritzson (petar,petfr)@ida.liu.se Dept. of Computer and Information Science, Linköping
More informationEvaluation of Parallel Programs by Measurement of Its Granularity
Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl
More informationReliability and Scheduling on Systems Subject to Failures
Reliability and Scheduling on Systems Subject to Failures Mourad Hakem and Franck Butelle LIPN CNRS UMR 7030 Université Paris Nord Av. J.B. Clément 93430 Villetaneuse France {Mourad.Hakem,Franck.Butelle}@lipn.univ-paris3.fr
More informationHEURISTIC BASED TASK SCHEDULING IN MULTIPROCESSOR SYSTEMS WITH GENETIC ALGORITHM BY CHOOSING THE ELIGIBLE PROCESSOR
HEURISTIC BASED TASK SCHEDULING IN MULTIPROCESSOR SYSTEMS WITH GENETIC ALGORITHM BY CHOOSING THE ELIGIBLE PROCESSOR Probir Roy 1, Md. Mejbah Ul Alam 1 and Nishita Das 2 1 Bangladesh University of Engineering
More informationArchitecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting
Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Natawut Nupairoj and Lionel M. Ni Department of Computer Science Michigan State University East Lansing,
More informationA 3. CLASSIFICATION OF STATIC
Scheduling Problems For Parallel And Distributed Systems Olga Rusanova and Alexandr Korochkin National Technical University of Ukraine Kiev Polytechnical Institute Prospect Peremogy 37, 252056, Kiev, Ukraine
More informationISHFAQ AHMAD 1 AND YU-KWONG KWOK 2
Optimal and Near-Optimal Allocation of Precedence-Constrained Tasks to Parallel Processors: Defying the High Complexity Using Effective Search Techniques Abstract Obtaining an optimal schedule for a set
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationGrid Scheduler. Grid Information Service. Local Resource Manager L l Resource Manager. Single CPU (Time Shared Allocation) (Space Shared Allocation)
Scheduling on the Grid 1 2 Grid Scheduling Architecture User Application Grid Scheduler Grid Information Service Local Resource Manager Local Resource Manager Local L l Resource Manager 2100 2100 2100
More informationModule 5 Introduction to Parallel Processing Systems
Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this
More informationMemory Hierarchy Management for Iterative Graph Structures
Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationAn Effective Load Balancing Task Allocation Algorithm using Task Clustering
An Effective Load Balancing Task Allocation using Task Clustering Poornima Bhardwaj Research Scholar, Department of Computer Science Gurukul Kangri Vishwavidyalaya,Haridwar, India Vinod Kumar, Ph.d Professor
More informationLIST BASED SCHEDULING ALGORITHM FOR HETEROGENEOUS SYSYTEM
LIST BASED SCHEDULING ALGORITHM FOR HETEROGENEOUS SYSYTEM C. Subramanian 1, N.Rajkumar 2, S. Karthikeyan 3, Vinothkumar 4 1 Assoc.Professor, Department of Computer Applications, Dr. MGR Educational and
More informationQoS-constrained List Scheduling Heuristics for Parallel Applications on Grids
16th Euromicro Conference on Parallel, Distributed and Network-Based Processing QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids Ranieri Baraglia, Renato Ferrini, Nicola Tonellotto
More informationSpeed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori
The Computer Journal, 46(6, c British Computer Society 2003; all rights reserved Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes Tori KEQIN LI Department of Computer Science,
More informationA Large-Grain Parallel Programming Environment for Non-Programmers
Calhoun: The NPS Institutional Archive Faculty and Researcher Publications Faculty and Researcher Publications 1994 A Large-Grain Parallel Programming Environment for Non-Programmers Lewis, Ted IEEE http://hdl.handle.net/10945/41304
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationCase Studies on Cache Performance and Optimization of Programs with Unit Strides
SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationTask Allocation for Minimizing Programs Completion Time in Multicomputer Systems
Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Gamal Attiya and Yskandar Hamam Groupe ESIEE Paris, Lab. A 2 SI Cité Descartes, BP 99, 93162 Noisy-Le-Grand, FRANCE {attiyag,hamamy}@esiee.fr
More informationy(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*
SIAM J. ScI. STAT. COMPUT. Vol. 12, No. 6, pp. 1480-1485, November 1991 ()1991 Society for Industrial and Applied Mathematics 015 SOLUTION OF LINEAR SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS ON AN INTEL
More informationTransactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN
Toward an automatic mapping of DSP algorithms onto parallel processors M. Razaz, K.A. Marlow University of East Anglia, School of Information Systems, Norwich, UK ABSTRACT With ever increasing computational
More informationHETEROGENEOUS COMPUTING
HETEROGENEOUS COMPUTING Shoukat Ali, Tracy D. Braun, Howard Jay Siegel, and Anthony A. Maciejewski School of Electrical and Computer Engineering, Purdue University Heterogeneous computing is a set of techniques
More informationAN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1
AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University
More informationHomework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization
ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor
More informationOnline Optimization of VM Deployment in IaaS Cloud
Online Optimization of VM Deployment in IaaS Cloud Pei Fan, Zhenbang Chen, Ji Wang School of Computer Science National University of Defense Technology Changsha, 4173, P.R.China {peifan,zbchen}@nudt.edu.cn,
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationParallel Global Routing Algorithms for Standard Cells
Parallel Global Routing Algorithms for Standard Cells Zhaoyun Xing Computer and Systems Research Laboratory University of Illinois Urbana, IL 61801 xing@crhc.uiuc.edu Prithviraj Banerjee Center for Parallel
More informationOptimal Architectures for Massively Parallel Implementation of Hard. Real-time Beamformers
Optimal Architectures for Massively Parallel Implementation of Hard Real-time Beamformers Final Report Thomas Holme and Karen P. Watkins 8 May 1998 EE 382C Embedded Software Systems Prof. Brian Evans 1
More informationNew Optimal Load Allocation for Scheduling Divisible Data Grid Applications
New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,
More informationParallel Algorithm Design. Parallel Algorithm Design p. 1
Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html
More informationA Dynamic Load Balancing Algorithm for eterogeneous Computing Environment
A Dynamic Load Balancing Algorithm for eterogeneous Computing Environment Piyush Maheshwari School of Computing and Information Technology Griffith University, Brisbane, Queensland, Australia 4 111 Abstract
More informationEvaluation of a Semi-Static Approach to Mapping Dynamic Iterative Tasks onto Heterogeneous Computing Systems
Evaluation of a Semi-Static Approach to Mapping Dynamic Iterative Tasks onto Heterogeneous Computing Systems YU-KWONG KWOK 1,ANTHONY A. MACIEJEWSKI 2,HOWARD JAY SIEGEL 2, ARIF GHAFOOR 2, AND ISHFAQ AHMAD
More informationAbstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE
A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationScheduling Using Multi Objective Genetic Algorithm
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 3, Ver. II (May Jun. 2015), PP 73-78 www.iosrjournals.org Scheduling Using Multi Objective Genetic
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationInformation Discovery, Extraction and Integration for the Hidden Web
Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk
More informationControlled duplication for scheduling real-time precedence tasks on heterogeneous multiprocessors
Controlled duplication for scheduling real-time precedence tasks on heterogeneous multiprocessors Jagpreet Singh* and Nitin Auluck Department of Computer Science & Engineering Indian Institute of Technology,
More informationAn Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm
An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm Henan Zhao and Rizos Sakellariou Department of Computer Science, University of Manchester,
More informationAn Experimental Assessment of Express Parallel Programming Environment
An Experimental Assessment of Express Parallel Programming Environment Abstract shfaq Ahmad*, Min-You Wu**, Jaehyung Yang*** and Arif Ghafoor*** *Hong Kong University of Science and Technology, Hong Kong
More informationSCHEDULING AND LOAD SHARING IN MOBILE COMPUTING USING TICKETS
Baskiyar, S. and Meghanathan, N., Scheduling and load balancing in mobile computing using tickets, Proc. 39th SE-ACM Conference, Athens, GA, 2001. SCHEDULING AND LOAD SHARING IN MOBILE COMPUTING USING
More informationA Duplication Based List Scheduling Genetic Algorithm for Scheduling Task on Parallel Processors
A Duplication Based List Scheduling Genetic Algorithm for Scheduling Task on Parallel Processors Dr. Gurvinder Singh Department of Computer Science & Engineering, Guru Nanak Dev University, Amritsar- 143001,
More informationCharacterizing Home Pages 1
Characterizing Home Pages 1 Xubin He and Qing Yang Dept. of Electrical and Computer Engineering University of Rhode Island Kingston, RI 881, USA Abstract Home pages are very important for any successful
More informationJob Re-Packing for Enhancing the Performance of Gang Scheduling
Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT
More informationEngineering Drawings Recognition Using a Case-based Approach
Engineering Drawings Recognition Using a Case-based Approach Luo Yan Department of Computer Science City University of Hong Kong luoyan@cs.cityu.edu.hk Liu Wenyin Department of Computer Science City University
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationIntegrating MRPSOC with multigrain parallelism for improvement of performance
Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,
More informationLayer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints
Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Jörg Dümmler, Raphael Kunis, and Gudula Rünger Chemnitz University of Technology, Department of Computer Science,
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More information1.2 Numerical Solutions of Flow Problems
1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian
More informationParallel Query Processing and Edge Ranking of Graphs
Parallel Query Processing and Edge Ranking of Graphs Dariusz Dereniowski, Marek Kubale Department of Algorithms and System Modeling, Gdańsk University of Technology, Poland, {deren,kubale}@eti.pg.gda.pl
More informationThree basic multiprocessing issues
Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated
More informationLOW-DENSITY PARITY-CHECK (LDPC) codes [1] can
208 IEEE TRANSACTIONS ON MAGNETICS, VOL 42, NO 2, FEBRUARY 2006 Structured LDPC Codes for High-Density Recording: Large Girth and Low Error Floor J Lu and J M F Moura Department of Electrical and Computer
More informationFractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures
Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin
More informationAs computer networks and sequential computers advance,
Complex Distributed Systems Muhammad Kafil and Ishfaq Ahmad The Hong Kong University of Science and Technology A distributed system comprising networked heterogeneous processors requires efficient task-to-processor
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop
More information6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP
LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,
More informationArchitectural Considerations for Network Processor Design. EE 382C Embedded Software Systems. Prof. Evans
Architectural Considerations for Network Processor Design EE 382C Embedded Software Systems Prof. Evans Department of Electrical and Computer Engineering The University of Texas at Austin David N. Armstrong
More informationModeling and Scheduling for MPEG-4 Based Video Encoder Using a Cluster of Workstations
Modeling and Scheduling for MPEG-4 Based Video Encoder Using a Cluster of Workstations Yong He 1,IshfaqAhmad 2, and Ming L. Liou 1 1 Department of Electrical and Electronic Engineering {eehey, eeliou}@ee.ust.hk
More informationpc++/streams: a Library for I/O on Complex Distributed Data-Structures
pc++/streams: a Library for I/O on Complex Distributed Data-Structures Jacob Gotwals Suresh Srinivas Dennis Gannon Department of Computer Science, Lindley Hall 215, Indiana University, Bloomington, IN
More informationA Robust Wipe Detection Algorithm
A Robust Wipe Detection Algorithm C. W. Ngo, T. C. Pong & R. T. Chin Department of Computer Science The Hong Kong University of Science & Technology Clear Water Bay, Kowloon, Hong Kong Email: fcwngo, tcpong,
More informationA New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *
A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationMODIFIED VERTICAL HANDOFF DECISION ALGORITHM FOR IMPROVING QOS METRICS IN HETEROGENEOUS NETWORKS
MODIFIED VERTICAL HANDOFF DECISION ALGORITHM FOR IMPROVING QOS METRICS IN HETEROGENEOUS NETWORKS 1 V.VINOTH, 2 M.LAKSHMI 1 Research Scholar, Faculty of Computing, Department of IT, Sathyabama University,
More informationAN ABSTRACT OF THE THESIS OF. Title: Static Task Scheduling and Grain Packing in Parallel. Theodore G. Lewis
AN ABSTRACT OF THE THESIS OF Boontee Kruatrachue for the degree of Doctor of Philosophy in Electrical and Computer Engineering presented on June 10. 1987. Title: Static Task Scheduling and Grain Packing
More informationTechnische Universitat Munchen. Institut fur Informatik. D Munchen.
Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl
More informationDSC: Scheduling Parallel Tasks on an Unbounded Number of. Processors 3. Tao Yang and Apostolos Gerasoulis. Department of Computer Science
DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors 3 Tao Yang and Apostolos Gerasoulis Department of Computer Science Rutgers University New Brunswick, NJ 08903 Email: ftyang, gerasoulisg@cs.rutgers.edu
More information