CASCH: A Software Tool for Automatic Parallelization and Scheduling of Programs on Message-Passing Multiprocessors

Size: px
Start display at page:

Download "CASCH: A Software Tool for Automatic Parallelization and Scheduling of Programs on Message-Passing Multiprocessors"

Transcription

1 CASCH: A Software Tool for Automatic Parallelization and Scheduling of Programs on Message-Passing Multiprocessors Abstract Ishfaq Ahmad 1, Yu-Kwong Kwok 2, Min-You Wu 3, and Wei Shu 3 1 Department of Computer Science, The Hong Kong University of Science and Technology Clear Water Bay, Hong Kong 2 Department of Electrical and Electronic Engineering, The University of Hong Kong Pokfulam Road, Hong Kong 3 Department of Electrical and Computer Engineering University of Central Florida, Orlando, FL , USA iahmad@cs.ust.hk, ykwok@eee.hku.hk, {wu, shu}@eee.engr.ucf.edu Corresponding Author: Ishfaq Ahmad Revised: May 1999 Existing parallel machines provide tremendous potential for high performance but their programming can be a cumbersome and error-prone process. This process is multi-phase in nature and consists of designing an appropriate parallel algorithm for the application at hand, implementing the algorithm by partitioning control and data, scheduling and mapping of the partitioned program onto the processors, orchestrating communication and synchronization, and identifying and interpreting various performance measures. A number of these phases, such as scheduling, mapping, communication, etc., can be very tedious if done manually, and thus should be done automatically. On the other hand, some of the more complex phases, such as parallelization, are better done semi-automatically. Software tools providing automatic functionalities free programmers from the nuisance of manual labor and can ensure better performance through code restructuring and optimization. This paper describes an experimental software tool called CASCH (Computer Aided SCHeduling) for parallelizing and scheduling applications on message-passing multiprocessors. CASCH transforms a sequential program to a parallel program with automatic scheduling, mapping, communication, and synchronization. The major strength of CASCH is its extensive library of scheduling and mapping algorithms representing a broad range of state-of-the-art work reported in the recent literature. These algorithms can be interactively analyzed, tested and compared using real data on a common platform with various performance objectives, enabling the programmer to select the most suitable algorithm for the application. CASCH with its graphical interface can be auspicious for both naive and expert programmers of parallel machines, and can also serve as a teaching and learning aid for understanding scheduling and mapping algorithms. Keywords: Automatic parallelization, scheduling, parallel programs, task graphs, message-passing multiprocessors, software tools.

2 1 Introduction Automated parallel programming environments are highly desirable by the programmers of parallel computers. Software tools embedded in a parallel programming environment can carry out a number of tasks, such as interprocessor communication and proper scheduling, freeing an average programmer from the major hurdles of parallelization and potentially improving the performance. Since these tasks can be quite tedious if done manually, the availability of automated tools is useful for an experienced programmer as well. Even though a large body of literature exists in the area of scheduling and mapping [1], [6], [7] (see Sidebar 1), a little portion of such knowledge has been exploited for practical purposes. Some software tools supporting automatic scheduling and mapping have been proposed but their main function is to provide a simulation environment [5]. While they can help in understanding the operation and behavior of scheduling and mapping algorithms, they are inadequate for practical purposes. On the other end of the spectrum, a large number of parallelizing tools (see Sidebar 2) have been proposed but they are usually not well integrated with sophisticated scheduling algorithms. This paper describe a software tool called CASCH (Computer Aided SCHeduling) for parallel processing on distributed-memory multiprocessors. CASCH is aimed to be a complete parallel programming environment, including parallelization, partitioning, scheduling, mapping, communication, synchronization, code generation, and performance evaluation. Program parallelization is performed by a compiler that automatically converts sequential applications into parallel codes. The parallel code to be executed on a target machine is optimized through proper scheduling and mapping,. CASCH is a unique tool in that it provides all of the important ingredients for developing parallel programs. It is useful for a naive programmer since its parallelization and code generation are done automatically. It can also help an experienced researcher since it provides various facilities to fine-tune and optimize a program. CASCH includes an extensive library of state-of-the-art scheduling algorithms from the recent literature. The library of scheduling algorithms is organized into different categories that are suitable for different architectural environments. The user can select one of these algorithms for scheduling the task graph generated from the application. The weights on the nodes and edges of the task graph are inserted using a database that contains the timing of various computation, communication, and I/O operations for different machines. These timings have been obtained through benchmarking. An attractive feature of CASCH is its graphical user interface which provides a flexible and easy-touse interactive environment for analyzing various scheduling and mapping algorithms, using task graphs generated randomly, interactively, or directly from real programs. The best schedule can be selected to be used by the code generator to generate a parallel code for a particular machine; the same process can be repeated for another machine. CASCH also can be used as a teaching aid for learning scheduling and mapping algorithms since it allows an interactive creation of task graphs and machine topologies, and provides a trace of a schedule which permits the identification of the order in which tasks are scheduled by a particular algorithm

3 The rest of this paper is organized as follows. Section 2 gives an overview of CASCH and describes it major functionalities. Section 3 includes the results of the experiments conducted on an Intel Paragon and an IBM SP2 using CASCH. The last section includes a discussion of future work and some concluding remarks. The first sidebar is a survey of scheduling algorithms and includes a taxonomy to classify the algorithms. The second sidebar discusses the related work and provides an overview of the reported programming and scheduling tools. 2 Overview of CASCH The overall organization of CASCH is shown in Figure 1. The main components of CASCH are: a compiler which includes a lexical analyzer and a parser; a task graph generator; a weight estimator; a scheduling/mapping module; a communication inserter; an interactive user interface; a code generator; a performance evaluation module. Interactive User Interface Weight Estimator Graphical Editing Tools Architectures Display Sequential User program Clusters of Workstations Intel Paragon Computation Timings Task Graphs Display Lexer and Parser Communication Timings Gantt Charts Communication Traffic Symbolic Table Task Graph Generator Scheduler Mapper Input/Output Timings UNC Algorithms BNP Algorithms APN Algorithms Mapping Algorithms Scheduling/Mapping Module EZ by Sarkar LC by Kim & Browne DSC by Yang & Gerasoulis MD by Wu & Gajski DCP by Kwok & Ahmad Communication Inserter HLFET by Hu ISH by Krautrachue & Lewis MCP by Wu & Gajski Performance Evaluation Module Application Statistics Machine Statistics Code Generator Parallel Testing Performance Reports Program ETF by Hwang et al. by Sih & Lee LAST by Baxter & Patel MH by Lewis & Rewini by Sih & Lee BU by Mehdiratta & Ghose BSA by Kwok & Ahmad Figure 1: The various components and functionalities of CASCH

4 These components are described below. 2.1 User Programs Using the CASCH tool, the user first writes a sequential program from which a task graph is generated. To facilitate the automation of program development, the sequential program is composed of a set of procedures called from the main program. A procedure, which should be written using the singleassignment rule, is an indivisible unit of computation to be scheduled on one processor. The grain sizes of procedures are determined by the programmer, and can be modified. Figure 2 shows an example (an implementation of a fast Fourier transform algorithm) in which the data matrix is partitioned by columns across processors. In the serial program, the constant N = PN SN determines the problem size. Specifically, the constants PN and SN control the granularity of the partitioning: the larger the value of SN, the higher the granularity. In the current implementation of CASCH, these constants are defined by the user at compile-time. The procedures InterMult and IntraMult are called several times. The control dependencies can be ignored, so that a procedure call can be executed whenever all input data of the procedure are available. Data dependencies are defined by the single assignment of parameters in procedure calls. Communications can be invoked only at the beginning and the end of procedures. In other words, a procedure receives messages before it begins execution, and sends messages after it has finished execution. In general, the user is required to implement the application (e.g., FFT) only in the form of a sequential program consisting of a set of procedures. The sequential program is basically an ordinary C program except that the user has to insert a few annotations in the form of #define compiler directives which instruct CASCH to invoke the partitioning of data arrays. For instance, in the FFT example, the user just needs to define values of PN and SN in the header of the sequential C program. In this example, PN = 4 and SN = Lexical Analyzer and Parser The lexical analyzer and parser analyze the data dependencies and user defined partitions. In our implementation of CASCH, both components are constructed with the help of lex and yacc. If a syntax or semantic error is discovered in this stage, the user is advised to fix the problem before proceeding to the task graph generation phase. For a static program, the number of procedures are known before program execution. Many numerical types of applications belong to this static class of programs [8]. Such a program is system independent since communication primitives are not specified in the program. Data dependencies among the procedural parameters define a macro dataflow graph (i.e., the task graph) [10]. 2.3 Task Graph Generation A macro dataflow graph, which is generated directly from the main program, is a directed acyclic graph (DAG) with start and end points. A macro dataflow graph consists of a set of tasks { T 1, T 2,, T n } and a set of edges { e 1, e 2,, e m } such that e k = T i T j for 1 k m and some i, j where 1 i, j n

5 Each node in the graph corresponds to a procedure or a task, and the node weight is represented by the procedure execution time. Each edge corresponds to a message transferred from one procedure to another procedure, and the weight of the edge is equal to the transmission time of the message. When two tasks are scheduled to a single processor, the weight of the edge connecting them becomes zero. 2.4 Weight Estimator The weights on the nodes and edges of the task graph are inserted with the help of an estimator that provides the execution times costs of various instructions as well as the cost of communication on a given machine. These timings have been obtained through benchmarking using an approach similar to [2], [4]. Communication estimation, which is obtained experimentally, is based on the cost for each communication primitive, such as send, receive, and broadcast. Our approach is similar to that used by Xu and Hwang [11]. Table 1 shows the communication times (assuming a stand-alone mode) for various target machines. Table 1: Communication timing constants (microseconds) for various target machines. Machine Start-up Rate per byte 1/ClockRate Intel Paragon IBM SP The current version of the computation estimator is a symbolic estimator. The estimation is based on reading through the code without running it. Its symbolic output is in the form of a function of input parameters of the code. With a symbolic estimator and a restricted class of C codes, the code does not need re-estimation for different problem sizes. The code may include functions and procedures, and the estimator generates performance for each of them. The code may have for loops. The boundaries of a loop can be either constants or input parameters. The cost of each operation or built-in function is specified in the cost files. The total amount of computation can be obtained by summing all costs of operations and functions for a segment of code. 2.5 Task Scheduling and Mapping A common approach to distribute the workload among p processors is to partition a problem into p tasks and perform a one-to-one mapping between the tasks and the processors. Partitioning can be done with the block, cyclic, or block-cyclic pattern [10]. Such partitioning schemes using simple scheduling heuristics such as the owner computes rule work for certain problems but could fail for many others, especially for irregular problems, as it is difficult to balance the load and minimize dependencies simultaneously. An irregular problem is handled by partitioning it into many tasks which are scheduled in order to balance the load and minimize communication. In CASCH, a task graph generated based on this partitioning is scheduled using a scheduling algorithm. Since, one scheduling algorithm may not be suitable for a certain problem on a given architecture [7], CASCH includes various algorithms which are We shall use node and task synonymously

6 Program FFT /* N: number of points for discrete Fourier transform, let N=PNxSN */ /* data[log(pn)+2][pn][sn] */ /* stores single-assigned data points for discrete Fourier */ /* transform organized as a PN x SN grid for parallel computation */ /* main program */ call Initiation; /* serial part; initialize the array `data' */ /* parallel inter-multiplication of data points */ for i = log(pn) downto 1 do for j = 0 to PN-1 step 1<<i do for k = 0 to 1<<(i-1)-1 do call InterMult(data[i+1][j+k], data[i+1][j+k+1<<(i-1)], data[i][j+k], SN); call InterMult(data[i+1][j+k+1<<(i-1)], data[i+1][j+k], data[i][j+k+1<<(i-1)], SN); /* in each iteration, InterMult can be executed if */ /* arrays data[i+1][j+k] and data[i+1][j+k+1<<(i-1)]*/ /* are available upon completion, data[i][j+k] and */ /* data[i][j+k+1<<(i-1)] will be avaliable */ endfor endfor endfor /* parallel intra-multiplication of data points */ for i = 0 to PN-1 do call IntraMult(data[1][i], data[0][i], SN); /* in each iteration, IntraMult can be executed if array */ /* data[1][i] is avaliable; upon completion, data[0][i], */ /* which is the result, will be avaliable */ endfor call OutputResult; /* serial part; inverse and return results */ EndProgram FFT /* Procedure InterMult */ Procedure InterMult(inArray1, inarray2, outarray, n) /* Input: inarray1, inarray2 data points for multiplication */ /* n number of data points in sub-array */ /* Output: outarray array of output data */ for i = 0 to n-1 do outarray[i] = inarray2[i];/* '@' is element-wide */ /* complex FFT operation*/ endfor EndProcedure InterMult /* Procedure IntraMult */ Procedure IntraMult(inArray, outarray, n) /* Input: inarray data points for multiplication */ /* n number of data points in sub-array */ /* Output: outarray array of output data */ for i = log(n) downto 1 do for j = 0 to n-1 step 1<<i do for k = 0 to 1<<(i-1)-1 do outarray[j+k] = inarray[j+k+1<<(i-1)]; outarray[j+k+1<<(i-1)] = inarray[j+k]; /* where '@' is element-wide complex FFT operation */ endfor endfor for j = 0 to n-1 do inarray[j] = outarray[j]; endfor endfor EndProcedure IntraMult Figure 2: A sequential program for fast Fourier transform. suitable to various environments. The advantages of having a wide variety of algorithms in CASCH are: The diversity of these heuristic algorithms allows the user to select a type of algorithm that is suitable to a particular architectural configuration

7 The common platform provided by CASCH allows simultaneous comparisons among various algorithms, based on a number of performance objectives such as schedule length, number of processors used, algorithm s running time, etc. The comparison among the algorithms can be done using manually-generated task graphs as well as real data measured at execution time of a number of applications. For a given application program, the user can optimize the code by running various scheduling algorithms and then choose the best schedule. 2.6 Communication Inserter Synchronization among the tasks running on multiple processors is carried out by communication primitives. The basic communication primitives for exchanging messages between processors are send and receive. They must be used properly to ensure a correct sequence of computation. These primitives can be inserted automatically, reducing the programmer s burden and eliminating insertion errors. The procedure for inserting communication primitive is as follows. After scheduling and mapping, each task has been allocated to a processor. If an edge leaves from a task to another task which belongs to a different processor, the send primitive is inserted after the task. Similarly, if an edge comes from another task in a different processor, the receive primitive is inserted before the task. The insertion method described above does not ensure a correct communication sequence because a deadlock may occur. Thus, we use a send-first strategy for a reordering of communication primitives. That is, we reorder receives according to the order of sends. The communication primitive insertion algorithm is described below. Communication Insertion Algorithm: Assume that after scheduling and mapping each task of the task graph is allocated to processor MT ( i ), where M is a function mapping a task number to a processor number. (1) For each edge e k from task T i to T j for which MT ( i ) MT ( j ), insert a send primitive after task in processor MT ( i ), denoted by Se ( k, T i, MT ( j )); insert a receive primitive before task T j in processor MT ( j ), denoted by Re ( k, T j, MT ( i )). Once a message has been scheduled to be sent to a processor, eliminate other sends and receives that transfer the same message to the same processor. Now, for each processor, we have a sequence, Xe ( m1, T m1, P m1 ), Xe ( m2, T m2, P m2 ),, where X could be either S or R. (2) For each pair of processors, say P 1 and P 2, extract all Se ( mi, T mi, P 2 ) from processor P 1 to form a subsequence S P1, and extract all Re ( mj, T mj, P 1 ) from processor P 2 to form a subsequence R P2. Step 2.1: Within each segment of the subsequence S P1 with the same task number, exchange the order of sends according to the order of receives as defined by the subsequence R P2. Step 2.2: If the two resultant subsequences are still not matched with each other, R P2 is reordered according to the order of. 2.7 Code Generation S P1 We use the example of Figure 2 to illustrate our code generation method. Figure 3 shows the generated T i T i - 6 -

8 parallel code for three processors (assuming N = 8 ). Note that only the main program for each processor is shown. The data structure is the same as in Figure 2. In this example, the initial data is stored at processor P0. Data is transmitted to other processors such that each processor obtains the portion of data required for its computation. Consequently, the memory space is compacted. To reduce the number of message transfers and, consequently, the time to initiate messages, several messages can be packed and sent together. For example, the first four messages can be packed into one message and sent to processor P0. Such optimizations are also implemented in CASCH. Finally, the fourth data partition of the result is received from processor P0, the third from processor P1, and the first and the second from processor P2. /* For P0 */ /* load array of data points from HOST */ receive(host, data[3][0]); receive(host, data[3][1]); receive(host, data[3][2]); receive(host, data[3][3]); InterMult(data[3][3],data[3][1],data[2][3],2); send(p1, data[2][3]); InterMult(data[3][1],data[3][3],data[2][1],2); InterMult(data[3][2],data[3][0],data[2][2],2); send(p1, data[2][2]); InterMult(data[3][0],data[3][2],data[2][0],2); InterMult(data[2][1],data[2][0],data[1][1],2); send(p2, data[1][1]); InterMult(data[2][0],data[2][1],data[1][0],2); send(p2, data[1][0]); InterMult(data[2][3],data[2][2],data[1][3],2); IntraMult(data[1][3],data[0][3],2); /* unload result array of data points to HOST */ send(host, data[0][3]); /* For P1 */ receive(p0, data[2][2]); receive(p0, data[2][3]); InterMult(data[2][2],data[2][3],data[1][2],2); IntraMult(data[1][2],data[0][2],2); /* unload result array of data points to HOST */ send(host, data[0][2]); /* For P2 */ receive(p0, data[1][1]); IntraMult(data[1][1],data[0][1],2); receive(p0, data[1][0]); IntraMult(data[1][0],data[0][0],2); /* unload result array of data points to HOST */ send(host, data[0][1]); send(host, data[0][0]); Figure 3: The parallel code for fast Fourier transform. 2.8 Graphical User Interface The graphical capabilities of CASCH provide the user with a an easy-to-use window-based interactive interface. The graphical interface includes the following facilities, which map to the buttons shown in Figure

9 Figure 4: The main menu of CASCH. Source: The user can create, edit, or browse through sequential programs. The source button also includes a sub-menu for generating a task graph from the user program. DAGs: This includes a facility to display a task graph (i.e., a DAG) generated from the user program (Figure 5 shows the DAG for the FFT program). Other options include the display of a randomly generated DAG or the interactive creation of a DAG. Zooming facilities (horizontally, vertically, or both) are included for proper viewing. Figure 5: Display of the DAG for the FFT program. TIGs: This facility displays task graphs, which are TIG (task interacting graphs with undirected edges). This facility is similar in functionality to DAGs. Processor Network: This facility allows the user to display a processor architecture (including the processors and the network topology). The editing facilities, similar to DAGs, allow the user to interactively create various network topologies. An example processor graph is illustrated in Figure 6. Scheduling: This facility includes a sub-menu from which the user must first select one of the three classes of the scheduling algorithms, i.e., BNP, UNC, and APN. Within each class, the user needs to - 8 -

10 Figure 6: Display of processor graph. select one of the scheduling algorithms. The scheduling algorithm requires the user to enter a number of parameters. Show Schedule: The schedule generated as the result of invoking a scheduling algorithm can be displayed using this facility (Figure 7 and Figure 8 show the scheduling of the FFT example by five different algorithms). A schedule is displayed using a Gantt chart showing the start and finish times of tasks on various processors. Clicking on any task in the Gantt chart displays its start and finish times; the total schedule length is shown in the right corner of the window. A schedule also includes communication messages on the network (displayed through another window which is invoked by clicking on any two processors). An important feature of this facility is the trace option which shows a step-by-step scheduling of each task. This is very useful for understanding the operation of a scheduling algorithm through observation of the order in which tasks are scheduled by the algorithm. Multiple such charts can be opened concurrently allowing a comparison among the schedules generated by various algorithms. Indeed, in most cases, it may be necessary to try different algorithms. Two additional scheduling examples are depicted by Figure 9 and Figure 10. Mapping: This set of facilities includes a number of mapping algorithms that are used to map TIGs onto the processors. At present, CASCH includes algorithms based on A*, recursive clustering, and simulated annealing [9]. Some scheduling algorithms (such as UNC algorithms) may first generate clusters that need to be mapped onto the processors using one of these mapping algorithms. Show Mapping: This shows an assignment of tasks to the processors generated by a mapping algorithm. The display includes a TIG in which a processor number is attached to each task (indicating the processor number to which this task is allocated). Code Generation: The code generator generates the parallel code for a given program according to a schedule/map generated by a scheduling/mapping algorithm. Performance: The performance facilities include processors utilization, time spent in computation For definitions of these terms, see sidebar, Recent Research in Multiprocessor Scheduling A Brief Introduction. Tasks 1 and 14 are shown as thin rectangles due to their small weights. A mapping algorithm is required if scheduling and mapping are done separately

11 (a) The schedule generated by the DCP algorithm (schedule length = 71). (b) The schedule generated by the DSC algorithm (schedule length = 68). (c) The schedule generated by the MD algorithm (schedule length = 78). Figure 7: Display of the Gantt charts showing the schedules generated by three UNC algorithms for the FFT program

12 Figure 8: Display of the Gantt chart showing the schedules generated by the MCP and ETF algorithms for the FFT program (schedule length = 83). Figure 9: A Gaussian elimination task graph. and communication, and speedup. The computation and communication timing results are obtained by inserting the dclock() procedure call before and after each inter-task communication. Data Partitioning: This facility includes tools for displaying structured and unstructured meshes as well as partitioning of data across different processors

13 (a) The schedule generated by the DCP algorithm (schedule length = 439). (b) The schedule generated by the MCP and ETF algorithms (schedule length = 431). Figure 10: The schedules for the Gaussian elimination graph (Figure 9) produced by two scheduling algorithms. 3 Results CASCH runs on a SUN workstation that is linked through a network to an Intel Paragon and an IBM SP2. We have parallelized several applications on CASCH by using several of the scheduling algorithms described above. Here we discuss some preliminary results obtained by measuring the performance of three applications: FFT, Laplace equation solver, and N-Body problem. These results demonstrate the viability and usefulness of CASCH as well as make a comparison among various scheduling algorithms. For reference, the results obtained with code generated by random scheduling of tasks are included. Such target code is generated by first partitioning the data among processors in a fashion that reduces the dependencies among the partitions. Based on this partitioning, an SPMD-based code is generated by randomly allocating the tasks to the processors. Hereafter, randsch is used to denote the results of these randomly scheduled programs. The first set of results (see Table 2) are for the FFT example shown earlier with four different sizes of

14 input data: 512, 1024, 2048, and 4096 points. Table 2 shows the execution times for various data sizes using different scheduling algorithms. Each value is an average of ten runs on the Intel Paragon and IBM SP2. The paragon consists of 140 i860/xp processors with a clock speed of 50MHz while the SP2 consists of MHz IBM P2SC processors. Table 2: Execution times of the FFT application for all the scheduling algorithms on the Intel Paragon and IBM SP2. Number of Points Algorithm randsch Paragon SP2 Paragon SP2 Paragon SP2 Paragon SP UNC BNP APN DCP DSC EZ LC MD ETF HLFET ISH LAST MCP BSA BU MH We observe that the execution times vary considerably with different algorithms. Among the UNC algorithms, the DCP algorithm yields the best performance due to its superior scheduling method; it also yields the best performance overall. Among the BNP algorithms, MCP and are in general better, primarily because of their better task priority assignment methods. Among the APN algorithms, BSA and MH perform better, due to their proper allocation of tasks and messages. All algorithms perform better than randsch: compared to the random scheduling, the level of performance improvement is up to 400%. Our second application is based on a Gauss-Seidel algorithm to solve Laplace equations. The four matrix sizes used are 8, 16, 32, and 64. The application execution times using various algorithms and data size are shown in Table 3. Again, using the DCP algorithm, more than 400% improvement over randsch is obtained. The UNC algorithms in general yield better schedules. The third application is the N-Body problem. The execution times results are shown in Table 4. Again, the scheduling algorithms demonstrate a similar trend in application execution times on both parallel machines as in the previous two applications. The running times of the scheduling algorithms for the three applications are shown in Table 5. Here, we can see that some scheduling algorithms take much longer times than the others due to their higher complexities (for details about the complexities of the algorithms, the reader is referred to [7]). Thus, there is a trade-off between the performance and the speed of a scheduling algorithm. For example, the DCP algorithm can generate better solutions the DSC algorithm but it is slower

15 Table 3: Execution times of the Laplace equation solver application for all the scheduling algorithms on the Intel Paragon and IBM SP2. Matrix Dimension UNC BNP APN Algorithm randsch DCP DSC EZ LC MD ETF HLFET ISH LAST MCP BSA BU MH Paragon SP2 Paragon SP2 Paragon SP2 Paragon SP Algorithm randsch Table 4: Execution times of the N-Body application for all the scheduling algorithms on the Intel Paragon and IBM SP2. Number of Points Paragon SP2 Paragon SP2 Paragon SP2 Paragon SP UNC BNP APN DCP DSC EZ LC MD ETF HLFET ISH LAST MCP BSA BU MH One important point to note from the above preliminary experimental results is that the performance of the scheduling algorithms can have substantial variations. For instance, even though the average performance of the DCP algorithm is better overall, it does perform worse in some cases. Thus, the user may have to try different schedulers to obtain the best results for the application at hand

16 Table 5: Scheduling times (seconds) for the applications on a SPARC Station 2 for all the scheduling algorithms. Number of Points Matrix Dimension Algorithm Algorithm UNC DCP DSC EZ LC MD UNC DCP DSC EZ LC MD BNP ETF HLFET ISH LAST MCP BNP ETF HLFET ISH LAST MCP APN BSA BU MH APN BSA BU MH (a) FFT application. (b) Laplace equation solver application. Number of Points UNC Algorithm DCP DSC EZ LC MD BNP ETF HLFET ISH LAST MCP APN BSA BU MH (c) N-Body application. 4 Conclusions and Future Work CASCH achieves the objectives of automatic parallelization and scheduling of applications by providing a unified environment for various existing and conceptual machines. A combination of a parallel code generator and a scheduler allows to test new ideas in scheduling with real applications instead of just simulations. CASCH also provides a framework for users to compare various scheduling algorithms and optimize their code by choosing the best algorithm. As shown in the experimental results, even an effective scheduling heuristic can sometimes produce inferior solutions. The extensive scheduling algorithms

17 library in CASCH, which includes various state-of-the-art scheduling algorithms, allows the user to optimize the execution of a parallel application by choosing the best schedule with the help of the interactive scheduling interface. We are currently working on extending the capabilities of CASCH by including the following: Support of distributed computing systems such as a collection of diverse machines working as a distributed heterogeneous supercomputer system; Extension of the current database of benchmark timings by including more detailed and lower level timings of various computation, communication and I/O operations of various existing machines; Inclusion of debugging facilities for error detection and global variable checking, etc.; Design and implementation of partitioners for automatic or interactive partitioning of programs; Design of an intelligent tool that will select an appropriate scheduling algorithm for a given application; and Enhance the task graph module so that huge task graphs (e.g., for Laplace equation of a larger matrix size) can be handled; in this regard, the parameterized task graph (PTG) technique proposed by Cosnard and Loi [3] is being considered for implementation. Acknowledgments The authors would like to thank the referees for their constructive and insightful comments which have greatly improved the presentation of this paper. Preliminary versions of portions of this paper have been presented at the 1997 International Conference on Parallel Processing, and the 3rd European Conference on Parallel Processing. This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST RI 93/94.EG06 and HKUST734/97E. References [1] E.G. Coffman, Computer and Job-Shop Scheduling Theory, Wiley, New York, [2] W.W. Chu, M.-T. Lan, and J. Hellerstein, Estimation of Intermodule Communication (IMC) and Its Applications in Distributed Processing Systems, IEEE Trans. Computers, vol. C-33, no. 8, pp , Aug [3] M. Cosnard and M. Loi, Automatic Task Graphs Generation Techniques, Parallel Processing Letters, vol. 5, no. 4, pp , Dec [4] A. Ghafoor and J. Yang, A Distributed Heterogeneous Supercomputing Management System, IEEE Computer, vol. 26, no. 6, pp , June [5] S.J. Kim and J.C. Browne, A General Approach to Mapping of Parallel Computation upon Multiprocessor Architectures, Proc. Int l Conf. Parallel Processing, vol. II, pp. 1-8, Aug [6] Y.-K. Kwok and I. Ahmad, Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs onto Multiprocessors, IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 5, pp , May [7] Y.-K. Kwok and I. Ahmad, Benchmarking the Task Graph Scheduling Algorithms, Proc. Int l Parallel Processing Symposium, pp , Apr [8] R.E. Lord, J.S. Kowalik, and S.P. Kumar, Solving Linear Algebraic Equations on an MIMD Computer, J. ACM, vol. 30, no. 1, pp , Jan [9] T.M. Nabhan and A.Y. Zomaya, A Parallel Simulated Annealing Algorithm with Low Communication Overhead, IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 12, pp

18 1233, Dec [10] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press, Cambridge, MA, [11] Z. Xu and K. Hwang, Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2, IEEE Parallel and Distributed Technology, vol. 4, no. 1, pp. 9-23, Spring Sidebar 1: Recent Research in Multiprocessor Scheduling A Brief Introduction The most common models of a parallel program are the precedence-constrained directed acyclic graph (DAG) and the task interacting graph (TIG) with no temporal dependencies. Figure 11(a) shows a parallel loop nest and Figure 11(b) depicts the DAG representing the loop. (1) Data 0, j := 0; for all j (3) for i = 1 to 3 do in parallel (4) for j = 1 to 3-(i-1) do in parallel (5) Task ij (INPUT: Data i-1, j, Data i-1, j+1 ; OUTPUT: Data ij ) (a) A parallel loop nest. Task 11 Task 12 Task 13 Data 11 Task 21 Data 12 Task22 Data 13 Data 21 Data 22 Task 31 Figure 11: (a) A parallel program fragment and (b) a directed acyclic graph representing the program fragment. (b) A DAG modeling the loop nest. The weight associated with a node represents the amount of execution time of the corresponding task and the weight associated with an edge represents the amount of communication time. Numerous techniques have been proposed in the literature for generating the node and edge weights off-line such as execution profiling and analytical benchmarking [5]. With such a static model, a scheduler is invoked offline during compile-time. This form of the multiprocessor scheduling problem is called static scheduling or DAG scheduling. Figure 12 provides a taxonomy of static parallel scheduling algorithms. The taxonomy is partial since it does not include details of some of the earlier work on scheduling. Only those scheduling algorithms that can be used in a realistic environment and are relevant in our context are considered. The taxonomy is hierarchical and develops by expanding each layer. Thick arrows indicate the relevance to our discussion and a further division of a particular layer; the thin arrows do not lead to a further division in the taxonomy. The highest level of the taxonomy is divided into two categories, depending upon whether the tasks are independent or not. This discussion is limited to dependent tasks. Earlier algorithms make simplifying assumptions about the task graph representing the program and the model of the multiprocessor system. Some algorithms ignore the precedence constraints and considered the task graph to be free of temporal dependencies (task interacting graph). The algorithms considering the more realistic task precedence

19 constrained graph assume the graph to be of a special structure such as tree, forks-join, etc. In general, however, parallel programs come in a variety of structures. The algorithms designed to tackle arbitrary graph structures can be divided further into two categories. Some algorithms assume the computational costs of all the tasks to be the same; others assume the computational costs of tasks to be arbitrary. It is worth mentioning that the scheduling problem is NP-complete even in two simple cases: (1) scheduling unit-time tasks to an arbitrary number of processors, (2) scheduling one or two time unit tasks to two processors. Scheduling with communication may be done with or without duplication of tasks [3]. Each class can be further divided into two categories. Note that, only the division of No-Duplication class is shown. An exact division of Duplication can also be envisaged but is not shown here due to its similarity with the No- Duplication class. Some scheduling algorithms assume the availability of an unlimited number of processors [2], [6], [7], [9], with a fully connected network. These are called the UNC (unbounded number of clusters) scheduling algorithms. The algorithms assuming a limited number of processors are called the BNP (bounded number of processors) scheduling algorithms. In the UNC and BNP scheduling algorithms, the processors are assumed to be fully-connected, and link contention or routing strategies used for communication are ignored. If scheduling and mapping are done in separate steps, the schedules generated by the UNC or BNP algorithms can be mapped onto the processors using the indirect mapping approach. The algorithms that assume the system to have an arbitrary network topology are called the APN (arbitrary processor network) scheduling algorithms [8]. The basis of a major component of scheduling algorithms (in all three classes) is the classical list scheduling approach [1], [4]. In list scheduling the tasks are assigned priorities and placed in a ready list arranged in a descending order of priority. The task with a higher priority is examined for scheduling before a task with a lower priority. If more than one task has the same priority, ties are broken using some method. A task selected for scheduling is allocated to a processor that allows the earliest start time. After a task is scheduled, more tasks may be added in the ready list. Again, the tasks in the ready list are examined and scheduled. This continues until all tasks are scheduled. The scheduling algorithm library of CASCH includes five UNC, six BNP, and four APN algorithms. The major characteristics of these algorithms are briefly described below. The reader is referred to [2] for more detailed description and comparison. References for Sidebar 1: [1] T.L. Adam, K.M. Chandy, and J. Dickson, A Comparison of List Scheduling for Parallel Processing Systems, Comm. ACM, vol. 17, no. 12, pp , Dec [2] I. Ahmad, Y.-K. Kwok, and M.-Y. Wu, Analysis, Evaluation, and Comparison of Algorithms for Scheduling Task Graphs on Parallel Processors, Proc. 2nd Int l Symposium on Parallel Architecture, Algorithms, and Networks, pp , June [3] I. Ahmad and Y.-K. Kwok, On Exploiting Task Duplication in Parallel Program Scheduling, IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 9, pp , Sept

20 Static Parallel Scheduling Independent Tasks Multiple Interacting Tasks Task Interaction Graph Task Precedence Graph Arbitrary Graph Structure Restricted Graph Structure Arbitrary Computational Costs Unit Computational Costs No Communication With Communication No Duplication Duplication Unlimited Number of Processors Limited Number of Processors Processors Fully Connected Processors Arbitrarily Connected UNC Algorithms BNP Algorithms APN Algorithms Figure 12: A partial taxonomy of the multiprocessor scheduling problem. [4] H. El-Rewini and T.G. Lewis, Scheduling Parallel Programs onto Arbitrary Target Machines, J. Parallel and Distributed Computing, vol. 9, no. 2, pp , June [5] K. Hwang, Z. Xu, and M. Arakawa, Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing, IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 5, pp , May [6] S.J. Kim and J.C. Browne, A General Approach to Mapping of Parallel Computation upon Multiprocessor Architectures, Proc. Int l Conf. Parallel Processing, vol. II, pp. 1-8, Aug [7] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press, Cambridge, MA, [8] G.C. Sih and E.A. Lee, A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures, IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 2, pp , Feb [9] T. Yang and A. Gerasoulis, DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors, IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 9, pp , Sept Sidebar 2: Other Parallel Programming Tools Several research efforts have demonstrated the usefulness of program development tools for parallel processing on message-passing multiprocessors. Essentially, these tools fall into two classes. The first

21 class, which is mainly comprised of commercial tools, provides software development and debugging environments. The ATEXPERT by Cray Research [1] is an example. Some of these tools also provide performance tuning tools and other program development facilities. The second class performs some program transformation through program restructuring. PARASCOPE [3] and TINY [9] are restructuring tools that automatically transform sequential programs to parallel programs. TOP/DOMDEC [2] is a tool for program partitioning. Some of the recently reported prototype scheduling tools are described below. PAWS [6] is a performance evaluation tool that provides an interactive environment for performance evaluation of various multiprocessor systems. PAWS does not perform scheduling and mapping, and does not generate any code. It is useful only for simulating the execution of an application on various machines. Hypertool [10] takes a user-partitioned sequential program as input and automatically allocates and schedules the partitions to processors. Proper synchronization primitives also are automatically inserted. Hypertool is a code generation tool since the user program is compiled into a parallel program for the ipsc/2 hypercube computer using parallel code synthesis and optimization techniques. The tool also generates performance estimates including execution time, communication and suspension times for each processor, as well as network delay for each communication channel. Scheduling is done using the MD algorithm or the MCP algorithm. PYRROS [8] is a compile-time scheduling and code generation tool. Its input is a task graph and the associated sequential C code. The output is a static schedule and a parallel C code for ipsc/2. PYRROS consists of a task graph language with an interface to C, a scheduling system that uses only the DSC algorithm, an x-windows based graphic displayer, and a code generator. The task graph language allows the user to define partitioned programs and data. The scheduling system is used for clustering the task graph, performing load balanced mapping, and computation/communication ordering. The graphic displayer is used for displaying task graphs and scheduling results in the form of Gantt charts. The code generator inserts synchronization primitives and performs parallel code optimization for the target parallel machine. Parallax [4] incorporates seven classical scheduling heuristics designed in the seventies, providing an environment for parallel program developers to find out how the schedulers affect program performance on various parallel architectures. Users must provide the input program as a task graph and estimate task execution times. Users must also express the target machine as an interconnection topology graph. Parallax then generates schedules in the form of Gantt charts, speedup curves, processor and communication efficiency charts using an x-windows interface. In addition, an animated display of the simulated running program helps developers to evaluate the differences among the provided scheduling heuristics. Parallex, however, is not reported to generate executable parallel code. OREGAMI [5] is designed for use in conjunction with parallel programming languages that support a communication model, such as OCCAM, C*, or with traditional programming languages like C and FORTRAN extended with communication facilities. As such, it is a set of tools that includes a LaRCS compiler to compile textual user task descriptions into specialized task graphs, which are called TCG

Programmers of parallel computers regard automated parallelprogramming

Programmers of parallel computers regard automated parallelprogramming Automatic Parallelization Ishfaq Ahmad Hong Kong University of Science and Technology Yu-Kwong Kwok University of Hong Kong Min-You Wu, and Wei Shu University of Central Florida The authors explain the

More information

Parallel Program Execution on a Heterogeneous PC Cluster Using Task Duplication

Parallel Program Execution on a Heterogeneous PC Cluster Using Task Duplication Parallel Program Execution on a Heterogeneous PC Cluster Using Task Duplication YU-KWONG KWOK Department of Electrical and Electronic Engineering The University of Hong Kong, Pokfulam Road, Hong Kong Email:

More information

Benchmarking and Comparison of the Task Graph Scheduling Algorithms

Benchmarking and Comparison of the Task Graph Scheduling Algorithms Benchmarking and Comparison of the Task Graph Scheduling Algorithms Yu-Kwong Kwok 1 and Ishfaq Ahmad 2 1 Department of Electrical and Electronic Engineering The University of Hong Kong, Pokfulam Road,

More information

A Comparison of Task-Duplication-Based Algorithms for Scheduling Parallel Programs to Message-Passing Systems

A Comparison of Task-Duplication-Based Algorithms for Scheduling Parallel Programs to Message-Passing Systems A Comparison of Task-Duplication-Based s for Scheduling Parallel Programs to Message-Passing Systems Ishfaq Ahmad and Yu-Kwong Kwok Department of Computer Science The Hong Kong University of Science and

More information

A Parallel Algorithm for Compile-Time Scheduling of Parallel Programs on Multiprocessors

A Parallel Algorithm for Compile-Time Scheduling of Parallel Programs on Multiprocessors A Parallel for Compile-Time Scheduling of Parallel Programs on Multiprocessors Yu-Kwong Kwok and Ishfaq Ahmad Email: {csricky, iahmad}@cs.ust.hk Department of Computer Science The Hong Kong University

More information

Efficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures

Efficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures Efficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures By Yu-Kwong KWOK A Thesis Presented to The Hong Kong University of Science and Technology in Partial Fulfilment

More information

Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems

Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems Yu-Kwong Kwok and Ishfaq Ahmad Department of Computer Science Hong Kong University of Science and

More information

A SIMULATION OF POWER-AWARE SCHEDULING OF TASK GRAPHS TO MULTIPLE PROCESSORS

A SIMULATION OF POWER-AWARE SCHEDULING OF TASK GRAPHS TO MULTIPLE PROCESSORS A SIMULATION OF POWER-AWARE SCHEDULING OF TASK GRAPHS TO MULTIPLE PROCESSORS Xiaojun Qi, Carson Jones, and Scott Cannon Computer Science Department Utah State University, Logan, UT, USA 84322-4205 xqi@cc.usu.edu,

More information

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology. A Fast Recursive Mapping Algorithm Song Chen and Mary M. Eshaghian Department of Computer and Information Science New Jersey Institute of Technology Newark, NJ 7 Abstract This paper presents a generic

More information

A STUDY OF BNP PARALLEL TASK SCHEDULING ALGORITHMS METRIC S FOR DISTRIBUTED DATABASE SYSTEM Manik Sharma 1, Dr. Gurdev Singh 2 and Harsimran Kaur 3

A STUDY OF BNP PARALLEL TASK SCHEDULING ALGORITHMS METRIC S FOR DISTRIBUTED DATABASE SYSTEM Manik Sharma 1, Dr. Gurdev Singh 2 and Harsimran Kaur 3 A STUDY OF BNP PARALLEL TASK SCHEDULING ALGORITHMS METRIC S FOR DISTRIBUTED DATABASE SYSTEM Manik Sharma 1, Dr. Gurdev Singh 2 and Harsimran Kaur 3 1 Assistant Professor & Head, Department of Computer

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors

Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors YU-KWONG KWOK The University of Hong Kong AND ISHFAQ AHMAD The Hong Kong University of Science and Technology Static

More information

Lecture 9: Load Balancing & Resource Allocation

Lecture 9: Load Balancing & Resource Allocation Lecture 9: Load Balancing & Resource Allocation Introduction Moler s law, Sullivan s theorem give upper bounds on the speed-up that can be achieved using multiple processors. But to get these need to efficiently

More information

A taxonomy of application scheduling tools for high performance cluster computing

A taxonomy of application scheduling tools for high performance cluster computing Cluster Comput (2006) 9:355 371 DOI 10.1007/s10586-006-9747-2 A taxonomy of application scheduling tools for high performance cluster computing Jiannong Cao Alvin T. S. Chan Yudong Sun Sajal K. Das Minyi

More information

Contention-Aware Scheduling with Task Duplication

Contention-Aware Scheduling with Task Duplication Contention-Aware Scheduling with Task Duplication Oliver Sinnen, Andrea To, Manpreet Kaur Department of Electrical and Computer Engineering, University of Auckland Private Bag 92019, Auckland 1142, New

More information

Implementation of Dynamic Level Scheduling Algorithm using Genetic Operators

Implementation of Dynamic Level Scheduling Algorithm using Genetic Operators Implementation of Dynamic Level Scheduling Algorithm using Genetic Operators Prabhjot Kaur 1 and Amanpreet Kaur 2 1, 2 M. Tech Research Scholar Department of Computer Science and Engineering Guru Nanak

More information

Link contention-constrained scheduling and mapping of tasks and messages to a network of heterogeneous processors

Link contention-constrained scheduling and mapping of tasks and messages to a network of heterogeneous processors Cluster Computing 3 (2000) 113 124 113 Link contention-constrained scheduling and mapping of tasks and messages to a network of heterogeneous processors Yu-Kwong Kwok a and Ishfaq Ahmad b a Department

More information

Multiprocessor Scheduling Using Task Duplication Based Scheduling Algorithms: A Review Paper

Multiprocessor Scheduling Using Task Duplication Based Scheduling Algorithms: A Review Paper Multiprocessor Scheduling Using Task Duplication Based Scheduling Algorithms: A Review Paper Ravneet Kaur 1, Ramneek Kaur 2 Department of Computer Science Guru Nanak Dev University, Amritsar, Punjab, 143001,

More information

Critical Path Scheduling Parallel Programs on an Unbounded Number of Processors

Critical Path Scheduling Parallel Programs on an Unbounded Number of Processors Critical Path Scheduling Parallel Programs on an Unbounded Number of Processors Mourad Hakem, Franck Butelle To cite this version: Mourad Hakem, Franck Butelle. Critical Path Scheduling Parallel Programs

More information

A Novel Task Scheduling Algorithm for Heterogeneous Computing

A Novel Task Scheduling Algorithm for Heterogeneous Computing A Novel Task Scheduling Algorithm for Heterogeneous Computing Vinay Kumar C. P.Katti P. C. Saxena SC&SS SC&SS SC&SS Jawaharlal Nehru University Jawaharlal Nehru University Jawaharlal Nehru University New

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Mapping of Parallel Tasks to Multiprocessors with Duplication *

Mapping of Parallel Tasks to Multiprocessors with Duplication * Mapping of Parallel Tasks to Multiprocessors with Duplication * Gyung-Leen Park Dept. of Comp. Sc. and Eng. Univ. of Texas at Arlington Arlington, TX 76019-0015 gpark@cse.uta.edu Behrooz Shirazi Dept.

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

On the Complexity of List Scheduling Algorithms for Distributed-Memory Systems.

On the Complexity of List Scheduling Algorithms for Distributed-Memory Systems. On the Complexity of List Scheduling Algorithms for Distributed-Memory Systems. Andrei Rădulescu Arjan J.C. van Gemund Faculty of Information Technology and Systems Delft University of Technology P.O.Box

More information

A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems

A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems Yi-Hsuan Lee and Cheng Chen Department of Computer Science and Information Engineering National Chiao Tung University, Hsinchu,

More information

An Extension of Edge Zeroing Heuristic for Scheduling Precedence Constrained Task Graphs on Parallel Systems Using Cluster Dependent Priority Scheme

An Extension of Edge Zeroing Heuristic for Scheduling Precedence Constrained Task Graphs on Parallel Systems Using Cluster Dependent Priority Scheme ISSN 1746-7659, England, UK Journal of Information and Computing Science Vol. 6, No. 2, 2011, pp. 083-096 An Extension of Edge Zeroing Heuristic for Scheduling Precedence Constrained Task Graphs on Parallel

More information

APPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH

APPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH APPROXIMATING A PARALLEL TASK SCHEDULE USING LONGEST PATH Daniel Wespetal Computer Science Department University of Minnesota-Morris wesp0006@mrs.umn.edu Joel Nelson Computer Science Department University

More information

Parallel Code Generation in MathModelica / An Object Oriented Component Based Simulation Environment

Parallel Code Generation in MathModelica / An Object Oriented Component Based Simulation Environment Parallel Code Generation in MathModelica / An Object Oriented Component Based Simulation Environment Peter Aronsson, Peter Fritzson (petar,petfr)@ida.liu.se Dept. of Computer and Information Science, Linköping

More information

Evaluation of Parallel Programs by Measurement of Its Granularity

Evaluation of Parallel Programs by Measurement of Its Granularity Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl

More information

Reliability and Scheduling on Systems Subject to Failures

Reliability and Scheduling on Systems Subject to Failures Reliability and Scheduling on Systems Subject to Failures Mourad Hakem and Franck Butelle LIPN CNRS UMR 7030 Université Paris Nord Av. J.B. Clément 93430 Villetaneuse France {Mourad.Hakem,Franck.Butelle}@lipn.univ-paris3.fr

More information

HEURISTIC BASED TASK SCHEDULING IN MULTIPROCESSOR SYSTEMS WITH GENETIC ALGORITHM BY CHOOSING THE ELIGIBLE PROCESSOR

HEURISTIC BASED TASK SCHEDULING IN MULTIPROCESSOR SYSTEMS WITH GENETIC ALGORITHM BY CHOOSING THE ELIGIBLE PROCESSOR HEURISTIC BASED TASK SCHEDULING IN MULTIPROCESSOR SYSTEMS WITH GENETIC ALGORITHM BY CHOOSING THE ELIGIBLE PROCESSOR Probir Roy 1, Md. Mejbah Ul Alam 1 and Nishita Das 2 1 Bangladesh University of Engineering

More information

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Natawut Nupairoj and Lionel M. Ni Department of Computer Science Michigan State University East Lansing,

More information

A 3. CLASSIFICATION OF STATIC

A 3. CLASSIFICATION OF STATIC Scheduling Problems For Parallel And Distributed Systems Olga Rusanova and Alexandr Korochkin National Technical University of Ukraine Kiev Polytechnical Institute Prospect Peremogy 37, 252056, Kiev, Ukraine

More information

ISHFAQ AHMAD 1 AND YU-KWONG KWOK 2

ISHFAQ AHMAD 1 AND YU-KWONG KWOK 2 Optimal and Near-Optimal Allocation of Precedence-Constrained Tasks to Parallel Processors: Defying the High Complexity Using Effective Search Techniques Abstract Obtaining an optimal schedule for a set

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Grid Scheduler. Grid Information Service. Local Resource Manager L l Resource Manager. Single CPU (Time Shared Allocation) (Space Shared Allocation)

Grid Scheduler. Grid Information Service. Local Resource Manager L l Resource Manager. Single CPU (Time Shared Allocation) (Space Shared Allocation) Scheduling on the Grid 1 2 Grid Scheduling Architecture User Application Grid Scheduler Grid Information Service Local Resource Manager Local Resource Manager Local L l Resource Manager 2100 2100 2100

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

An Effective Load Balancing Task Allocation Algorithm using Task Clustering

An Effective Load Balancing Task Allocation Algorithm using Task Clustering An Effective Load Balancing Task Allocation using Task Clustering Poornima Bhardwaj Research Scholar, Department of Computer Science Gurukul Kangri Vishwavidyalaya,Haridwar, India Vinod Kumar, Ph.d Professor

More information

LIST BASED SCHEDULING ALGORITHM FOR HETEROGENEOUS SYSYTEM

LIST BASED SCHEDULING ALGORITHM FOR HETEROGENEOUS SYSYTEM LIST BASED SCHEDULING ALGORITHM FOR HETEROGENEOUS SYSYTEM C. Subramanian 1, N.Rajkumar 2, S. Karthikeyan 3, Vinothkumar 4 1 Assoc.Professor, Department of Computer Applications, Dr. MGR Educational and

More information

QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids

QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids Ranieri Baraglia, Renato Ferrini, Nicola Tonellotto

More information

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori The Computer Journal, 46(6, c British Computer Society 2003; all rights reserved Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes Tori KEQIN LI Department of Computer Science,

More information

A Large-Grain Parallel Programming Environment for Non-Programmers

A Large-Grain Parallel Programming Environment for Non-Programmers Calhoun: The NPS Institutional Archive Faculty and Researcher Publications Faculty and Researcher Publications 1994 A Large-Grain Parallel Programming Environment for Non-Programmers Lewis, Ted IEEE http://hdl.handle.net/10945/41304

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Case Studies on Cache Performance and Optimization of Programs with Unit Strides SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Gamal Attiya and Yskandar Hamam Groupe ESIEE Paris, Lab. A 2 SI Cité Descartes, BP 99, 93162 Noisy-Le-Grand, FRANCE {attiyag,hamamy}@esiee.fr

More information

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE* SIAM J. ScI. STAT. COMPUT. Vol. 12, No. 6, pp. 1480-1485, November 1991 ()1991 Society for Industrial and Applied Mathematics 015 SOLUTION OF LINEAR SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS ON AN INTEL

More information

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press,   ISSN Toward an automatic mapping of DSP algorithms onto parallel processors M. Razaz, K.A. Marlow University of East Anglia, School of Information Systems, Norwich, UK ABSTRACT With ever increasing computational

More information

HETEROGENEOUS COMPUTING

HETEROGENEOUS COMPUTING HETEROGENEOUS COMPUTING Shoukat Ali, Tracy D. Braun, Howard Jay Siegel, and Anthony A. Maciejewski School of Electrical and Computer Engineering, Purdue University Heterogeneous computing is a set of techniques

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

Online Optimization of VM Deployment in IaaS Cloud

Online Optimization of VM Deployment in IaaS Cloud Online Optimization of VM Deployment in IaaS Cloud Pei Fan, Zhenbang Chen, Ji Wang School of Computer Science National University of Defense Technology Changsha, 4173, P.R.China {peifan,zbchen}@nudt.edu.cn,

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Parallel Global Routing Algorithms for Standard Cells

Parallel Global Routing Algorithms for Standard Cells Parallel Global Routing Algorithms for Standard Cells Zhaoyun Xing Computer and Systems Research Laboratory University of Illinois Urbana, IL 61801 xing@crhc.uiuc.edu Prithviraj Banerjee Center for Parallel

More information

Optimal Architectures for Massively Parallel Implementation of Hard. Real-time Beamformers

Optimal Architectures for Massively Parallel Implementation of Hard. Real-time Beamformers Optimal Architectures for Massively Parallel Implementation of Hard Real-time Beamformers Final Report Thomas Holme and Karen P. Watkins 8 May 1998 EE 382C Embedded Software Systems Prof. Brian Evans 1

More information

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,

More information

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Algorithm Design. Parallel Algorithm Design p. 1 Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html

More information

A Dynamic Load Balancing Algorithm for eterogeneous Computing Environment

A Dynamic Load Balancing Algorithm for eterogeneous Computing Environment A Dynamic Load Balancing Algorithm for eterogeneous Computing Environment Piyush Maheshwari School of Computing and Information Technology Griffith University, Brisbane, Queensland, Australia 4 111 Abstract

More information

Evaluation of a Semi-Static Approach to Mapping Dynamic Iterative Tasks onto Heterogeneous Computing Systems

Evaluation of a Semi-Static Approach to Mapping Dynamic Iterative Tasks onto Heterogeneous Computing Systems Evaluation of a Semi-Static Approach to Mapping Dynamic Iterative Tasks onto Heterogeneous Computing Systems YU-KWONG KWOK 1,ANTHONY A. MACIEJEWSKI 2,HOWARD JAY SIEGEL 2, ARIF GHAFOOR 2, AND ISHFAQ AHMAD

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Scheduling Using Multi Objective Genetic Algorithm

Scheduling Using Multi Objective Genetic Algorithm IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 3, Ver. II (May Jun. 2015), PP 73-78 www.iosrjournals.org Scheduling Using Multi Objective Genetic

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

Controlled duplication for scheduling real-time precedence tasks on heterogeneous multiprocessors

Controlled duplication for scheduling real-time precedence tasks on heterogeneous multiprocessors Controlled duplication for scheduling real-time precedence tasks on heterogeneous multiprocessors Jagpreet Singh* and Nitin Auluck Department of Computer Science & Engineering Indian Institute of Technology,

More information

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm Henan Zhao and Rizos Sakellariou Department of Computer Science, University of Manchester,

More information

An Experimental Assessment of Express Parallel Programming Environment

An Experimental Assessment of Express Parallel Programming Environment An Experimental Assessment of Express Parallel Programming Environment Abstract shfaq Ahmad*, Min-You Wu**, Jaehyung Yang*** and Arif Ghafoor*** *Hong Kong University of Science and Technology, Hong Kong

More information

SCHEDULING AND LOAD SHARING IN MOBILE COMPUTING USING TICKETS

SCHEDULING AND LOAD SHARING IN MOBILE COMPUTING USING TICKETS Baskiyar, S. and Meghanathan, N., Scheduling and load balancing in mobile computing using tickets, Proc. 39th SE-ACM Conference, Athens, GA, 2001. SCHEDULING AND LOAD SHARING IN MOBILE COMPUTING USING

More information

A Duplication Based List Scheduling Genetic Algorithm for Scheduling Task on Parallel Processors

A Duplication Based List Scheduling Genetic Algorithm for Scheduling Task on Parallel Processors A Duplication Based List Scheduling Genetic Algorithm for Scheduling Task on Parallel Processors Dr. Gurvinder Singh Department of Computer Science & Engineering, Guru Nanak Dev University, Amritsar- 143001,

More information

Characterizing Home Pages 1

Characterizing Home Pages 1 Characterizing Home Pages 1 Xubin He and Qing Yang Dept. of Electrical and Computer Engineering University of Rhode Island Kingston, RI 881, USA Abstract Home pages are very important for any successful

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Engineering Drawings Recognition Using a Case-based Approach

Engineering Drawings Recognition Using a Case-based Approach Engineering Drawings Recognition Using a Case-based Approach Luo Yan Department of Computer Science City University of Hong Kong luoyan@cs.cityu.edu.hk Liu Wenyin Department of Computer Science City University

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Jörg Dümmler, Raphael Kunis, and Gudula Rünger Chemnitz University of Technology, Department of Computer Science,

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

Parallel Query Processing and Edge Ranking of Graphs

Parallel Query Processing and Edge Ranking of Graphs Parallel Query Processing and Edge Ranking of Graphs Dariusz Dereniowski, Marek Kubale Department of Algorithms and System Modeling, Gdańsk University of Technology, Poland, {deren,kubale}@eti.pg.gda.pl

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can 208 IEEE TRANSACTIONS ON MAGNETICS, VOL 42, NO 2, FEBRUARY 2006 Structured LDPC Codes for High-Density Recording: Large Girth and Low Error Floor J Lu and J M F Moura Department of Electrical and Computer

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

As computer networks and sequential computers advance,

As computer networks and sequential computers advance, Complex Distributed Systems Muhammad Kafil and Ishfaq Ahmad The Hong Kong University of Science and Technology A distributed system comprising networked heterogeneous processors requires efficient task-to-processor

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop

More information

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,

More information

Architectural Considerations for Network Processor Design. EE 382C Embedded Software Systems. Prof. Evans

Architectural Considerations for Network Processor Design. EE 382C Embedded Software Systems. Prof. Evans Architectural Considerations for Network Processor Design EE 382C Embedded Software Systems Prof. Evans Department of Electrical and Computer Engineering The University of Texas at Austin David N. Armstrong

More information

Modeling and Scheduling for MPEG-4 Based Video Encoder Using a Cluster of Workstations

Modeling and Scheduling for MPEG-4 Based Video Encoder Using a Cluster of Workstations Modeling and Scheduling for MPEG-4 Based Video Encoder Using a Cluster of Workstations Yong He 1,IshfaqAhmad 2, and Ming L. Liou 1 1 Department of Electrical and Electronic Engineering {eehey, eeliou}@ee.ust.hk

More information

pc++/streams: a Library for I/O on Complex Distributed Data-Structures

pc++/streams: a Library for I/O on Complex Distributed Data-Structures pc++/streams: a Library for I/O on Complex Distributed Data-Structures Jacob Gotwals Suresh Srinivas Dennis Gannon Department of Computer Science, Lindley Hall 215, Indiana University, Bloomington, IN

More information

A Robust Wipe Detection Algorithm

A Robust Wipe Detection Algorithm A Robust Wipe Detection Algorithm C. W. Ngo, T. C. Pong & R. T. Chin Department of Computer Science The Hong Kong University of Science & Technology Clear Water Bay, Kowloon, Hong Kong Email: fcwngo, tcpong,

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

MODIFIED VERTICAL HANDOFF DECISION ALGORITHM FOR IMPROVING QOS METRICS IN HETEROGENEOUS NETWORKS

MODIFIED VERTICAL HANDOFF DECISION ALGORITHM FOR IMPROVING QOS METRICS IN HETEROGENEOUS NETWORKS MODIFIED VERTICAL HANDOFF DECISION ALGORITHM FOR IMPROVING QOS METRICS IN HETEROGENEOUS NETWORKS 1 V.VINOTH, 2 M.LAKSHMI 1 Research Scholar, Faculty of Computing, Department of IT, Sathyabama University,

More information

AN ABSTRACT OF THE THESIS OF. Title: Static Task Scheduling and Grain Packing in Parallel. Theodore G. Lewis

AN ABSTRACT OF THE THESIS OF. Title: Static Task Scheduling and Grain Packing in Parallel. Theodore G. Lewis AN ABSTRACT OF THE THESIS OF Boontee Kruatrachue for the degree of Doctor of Philosophy in Electrical and Computer Engineering presented on June 10. 1987. Title: Static Task Scheduling and Grain Packing

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

DSC: Scheduling Parallel Tasks on an Unbounded Number of. Processors 3. Tao Yang and Apostolos Gerasoulis. Department of Computer Science

DSC: Scheduling Parallel Tasks on an Unbounded Number of. Processors 3. Tao Yang and Apostolos Gerasoulis. Department of Computer Science DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors 3 Tao Yang and Apostolos Gerasoulis Department of Computer Science Rutgers University New Brunswick, NJ 08903 Email: ftyang, gerasoulisg@cs.rutgers.edu

More information