Heuristic scheduling algorithms to access the critical section in Shared Memory Environment

Size: px

Start display at page:

Download "Heuristic scheduling algorithms to access the critical section in Shared Memory Environment"

Emma Palmer
5 years ago
Views:

1 Heuristic scheduling algorithms to access the critical section in Shared Memory Environment Reda A. Ammar Computer Science Department University of Connecticut Storrs, CT , USA. Ali I. El-Desouky Computer & Control Department Faculty of Engineering Mansoura University, Egypt. Tahany A. Fergany Engineering Mahematics Departmenrt Faculty of Engineering Cairo University, Egypt. Mohamed M. Hefeeda Computer & Control Department Faculty of Engineering Mansoura University, Egypt. Abstract In shared memory parallel processing environment, shared variables facilitate communication among processes. To protect shared variables from concurrent access by more than one process at a time, they placed in a critical section. Scheduling a set of parallel processes to access this critical section with the aim of minimizing the time spent to execute these processes is a crucial problem in parallel processing. This paper presents heuristic scheduling algorithms to access this critical section. 1 Introduction The increasing demands for faster computers have led to the availability of many parallel computers. It is hoped that the impracticable computationallyintensive applications will be practicable by their execution on highly parallel computers. A number of factors prevent the growth of parallel computing. First, the substantial investment in sequential programming tools that aid in program testing, execution profiling, and interactive debugging. Second, the lack of a single, predominant, parallel architecture. Third, the difficulty of developing efficient programs for parallel computers. This paper addresses one of the obstacles that hinders producing efficient parallel programs; that is accessing the shared variables. In parallel programs, parallelism is gained through process creation. One of the most common mechanisms proposed for creation of processes is FORK/JOIN mechanism [1, 9], where the FORK statement spawns several processes and JOIN statement is used to synchronize the termination of processes. The portion of program between the FORK and JOIN is called the parallel structure. The semantics of the parallel structure require that exactly those processes created by FORK operation terminate at the associated JOIN operation and no operations after JOIN can start until all processes created by FORK are completed. The cooperation of n processes to solve a problem is useful only if the partial results are efficiently exchanged between processes. Shared variables facilitate communication among the processes. But they must be protected from nondeterminism, which can result from concurrent access by more than one process at a time. In order to protect the shared variables from nondeterminism, the code that hnadles these variables is placed in a critical section [1, 6, 9]. The critcal section is a section of code which can be executed by only one process at a time and which, once started, will be able to finish without interruption. Unfortunately, accessing the critical section by different processes will create a serial bottleneck that can seriously impair the performance of the software. Since shared memory multiprocessors are becoming more important in commercial environment, it becomes necessary to schedule shared memory access in the most efficient way. The scheduling problem [2-5, 7, 8, 10, 11] is complicated by the fact that each branch of parallel structure resulted from the FORK operation includes the time to process the portion of the code before accessing the shared variables, the time to access to the shared variables, and the time to process the portion of the code after using the shared variables, which may all be different. In order to make optimization possible, it is necessary to have an approach to quantify the time costs of parallel computations. After that, the time cost of processes which require access to the critical section can be minimized by using a suitable scheduling methods. The computation structure model [9] is used to represent the detailed time cost of a parallel structure. It is assumed in this model that the underlying computer system has a finite number of processors with the same speed and they communicate with each other through a shared 1

2 memory. In the computation structure model, the lock nodes are used to obtain locks on shared data and unlock nodes are used for releasing these locks. These locks facilitate protection of the shared variables. PLJ LASV RJ B1 Lock1 S1 Unlock1 A1 In Fig. 1 we have a parallel structure with n branches that all are in conflict, i.e. they need to access the critical section simultaneously. In this parallel structure we can classify the operations into the following three categories: 1. The operation before accessing the critical section is defined as Pre-Lock Job, PLJ. 2. The operation of accessing the critical section which contains three sub-operations. These suboperations are: the lock operation to prevent other branches to access the critical section; the access of the shared variables operation; and the unlock operation to free the critical section for the other branches. So that, this combined operation is defined as Lock and Access Shared Variables, LASV. 3. The operation after accessing the critical section which is defined as Remaining Job, RJ. Algorithms were developed to schedule the access of the critical section [1, 6, 9 ]. Branch and Bound algorithm [7, 8] was used to find the optimal order in which the conflicted processes access the critical section. Branch and bound algorithm produces the optimal solution but it may take a long time to find it especially for large number of processes, greater than 8. So that, other heuristic algorithms were suggested [2, 4 ] which can produce optimal or near optimal solutions in short time. Those algorithms are called comparison and adjustment algorithms. This paper Fork Locki Unlocki Fig. 1 Parallel structure model Bi Si Ai Join Join Bn Lockn Sn Unlockn An first evaluates these algorithms by simulation programs and compares between them. Second, it presents a new algorithm which gives a better results. 2 Previous Research Efforts Previously [2, 4, 7, 8], Algorithms were developed for accessing the critical section based on the time cost of the operations before the lock nodes, the time cost of the operations between the lock and unlock nodes, and the time cost of the operations after the unlock nodes. In the parallel structure in Fig. 1, assume that every two lock nodes are in conflict, and let: Time cost of the Pre-lock Job = PLJ i Time cost of the Lock and Access Shared Variables = LASV i Time cost of the Remaining Job = RJ i In order to schedule the operations between FORK/JOIN nodes (That are, the PLJs, the LASVs, and RJs) we considered eight possible cases which may arise in the parallel structure. These cases are listed in Table 1 along with their scheduling algorithms. In table 1, The = indicates that all jobs have the same time cost; and <> indicates that at least one job has a time cost different from the others. Algorithms for cases (I, II, III, IV, V, and VII) were mathematically proved to give the optimal solutions [7 ]. For cases VI, and VIII the Branch and Bound algorithm was developed which yields the minimum time cost for the parallel structure [7]. Although the Branch and Bound approach is widely acceptable technique [7], it is computationally expensive, especially when the problem size grows. So that heuristic algorithms were introduced which can produce optimal or near optimal solutions. Case PLJ LASV RJ Scheduling Algorithm I = = = FCFS or LRJF II = = <> LRJF III = <> = FCFS or LRJF IV = <> <> LRJF V <> = = FCFS VI <> = <> Branch and Bound VII <> <> = FCFS VIII <> <> <> Branch and Bound FCFS: First Come First Served, LRJF: Longest Remaining Job First Table 1 Scheduling Methods 2.1 Algorithm A heuristic algorithm, i.e. not mathematically proved, that finds optimal solutions in some cases and near optimal solutions in the others. It is simple compared to Branch and Bound algorithm therefore it takes less time. For the parallel structure in Fig. 1 2

3 with n conflicted branches, the comparison algorithm is applied as follows: 1. Use the Longest Remaining Job First, LRJF, scheduling policy to order the branches of the given parallel structure. 2. If for every i = 2, 3,..., n, PLJ i-1 < PLJ i, then the branches follows First Come First Served, FCFS, policy at the same time. No additional movements will be considered and the resulting order provides an optimal (or near optimal ) solution. 3. If for an i = 2, 3,..., n, PLJ i-1 > PLJ i, and PLJ i-1 - PLJ i < RJ i-1 - RJ i, we reverse the order of the branch i-1 with branch i. 4. Repeat step 3 until no more movements. 2.2 Adjustment Algorithm The comparison algorithm is easy to apply but we need to add another round of adjustments to produce the optimal solution. the adjustment process is based upon the following two phases of movements: 1. Look for a branch that follows the current maximum branch and whose communication cost is smaller than the communication cost of a branch that precedes the current maximum branch. Swapping of these two branches may reduce the execution time of the parallel structure. 2. Move the maximum branch, the branch whose execution time is the longest, to the front of the waiting queue. In this way it can access the critical section earlier and hence its execution time reduces. This adjustment process is an iterative process and will continue until no more improvements is possible. The comparison algorithm is used to derive the initial solution for the adjustment algorithm. 3 The New Adjustment Algorithm The adjustment algorithm produces optimal solutions in many cases and near optimal solutions in the others. Yet, we can add another round of enhancement, phase 3, which enhances the original adjustment algorithm and produces better results. Phase 3 states that: Moving the longest waiting branch, the branch that finishes its PLJ operation and waits the longest time to access the critical section, to the front of the waiting queue of the critical section may reduce the overall execution time. Thus, the new algorithm consists of the following three steps: 1. Look for a branch that follows the current maximum branch and whose communication cost is smaller than the communication cost of Apply phase 2 a branch that precedes the current maximum branch. Swapping of these two branches may After swaping branch 4 with branch 2. reduce the execution time of the parallel structure. Move the maximum branch, the branch whose execution time is the longest, to the front of the waiting queue. In this way it can access the critical section earlier and hence its execution time reduces. Move the longest waiting branch, the branch that finishes its PLJ operation and waits the longest time to access the critical section, to the front of the waiting queue of the critical section. Thus, it can access the critical section earlier and reduces its execution time and the overall execution time. Simulation results, see section 4, showed that applying the new algorithm with the order: phase 1, phase 2, and finally phase 3 gave better results than the original algorithm. Moreover, when we changed the order of the phases to be phase 2, phase 1, and finally phase 3 the algorithm gave much better results. But another combinations of the three phases gave worse results than the original algorithm. We tried the following combinations: (phase 2, phase 3, phase 1); (phase 2, phase 3, phase 1, phase 3); (phase 1, phase 3, phase 2, phase 3); (phase 1, phase 2, phase 3, phase 2, phase 3) and all of them gave worse results. 3.1 Example: This example describes the application of the new algorithm on a parallel structure consists of five branches each branch has three time costs, PLJ, LASV, and RJ. The comparison algorithm is used to derive the initial solution. The following figure shows the application of the new algorithm. Initial solution Apply phase PLJ LASV RJ Total Time Cost of each branch Max. branch After swaping branch 4 with branch 3. 3

4 Apply phase Waiting time of each branch. Longest waiting branch After swaping branch 4 with branch 1. Note that in the above figure, useless steps are omitted. The new adjustment algorithm can be written in steps as follows: 1. Find the branch k of the parallel structure, after applying the comparison algorithm, whose path has the longest execution time. 2. If k = 1, then the current parallel structure has the minimum possible execution time. 3. If the execution time of the parallel structure equals to the sum of execution times of PLJ k, LASV k, and RJ k, then the scheduling order we have is optimal and no additional improvement is possible. 4. Apply Phase 2 as follows: One)Initialize a displacement variable i to be 1. Two)Swap branch k with branch k-i. Evaluate the new execution times. Three)If the new order has larger overall execution time then keep the previous order, increment i, and go to step 4.b. Four)Evaluate the longest path of the parallel structure with the new order. If there is more than one branch has the same maximum value we use the back most one. Assume that the new maximum branch is j. Five)If j = 1 or the execution time of the current parallel structure equals to the sum of execution times of PLJ j, LASV j, and RJ j, then the scheduling order we have is optimal and no additional improvement is possible. Otherwise, apply phase 2 again until no more improvement is achieved. 5. Apply Phase 1 as follows: a) Set two pointers i (the front index) and j (the back index). The front index changes from 1 to k-1 and the back index changes from k+1 to n. For every value of j change i from 1 to k-1. If LASV i > LASV j then swap the two branches, evaluate the execution times of different branches, and test to see if the new order is better than the previous one. b) If the new order has larger overall execution time then keep the previous order, and try another swapping. c) Evaluate the longest path of the parallel structure with the new order. Assume that the new maximum branch is branch k. d) If k = 1 or the execution time of the current parallel structure equals to the sum of execution times of PLJ j, LASV j, and RJ j, then the scheduling order we have is optimal and no additional improvement is possible. Otherwise, apply phase 1 again until no more improvement is achieved. 6. Apply Phase 3 as follows: a) Find the branch w that has the maximum waiting time. The waiting time of a branch x is evaluated by subtracting the time cost of PLJ x of that branch from the time needed for the previous branch x-1 to finish the critical section. b) Initialize a displacement variable i to be 1. c) Swap the branch w with branch w-i. Evaluate the new execution time. d) If the new order has larger overall execution time then retrieve the previous order, increment i, and go to step 6.c. e) Evaluate the branch with maximum waiting time of the new parallel structure. Assume that the new branch is branch j. f) If j = 1 or the execution time of the current parallel structure equals to the sum of execution times of PLJ j, LASV j, and RJ j, then the scheduling order we have is optimal and no additional improvement is possible. Otherwise, apply phase 3 again until no more improvement is achieved. 4 Simulation Results This section, firstly, shows the effect of scheduling the critical section on the execution of parallel programs. Secondly, evaluates the scheduling algorithms and compares between them. 4.1 Effect of Scheduling To show the benefits of scheduling the access to the critical section, we developed a C++ simulation program. The program generates different number of branches, from 3 t0 8. For each branch, the program generates 500 sets of random values for PLJ, LASV, and RJ. Then, for each set it evaluates the execution time. Also, it finds the optimal order for the branches to access the critical section, this is done by trying all possisble permutations which equal the factorial of the number of branches. Then, it evaluates the optimal execution time. Eventually, it aggregates and averages the execution time and the 4

5 optimal execution time over the 500 sets. The following pseudo-code describes the structure of the main body of the program. for( branches=3;branches<=8; branches++) total_exec_t = 0; total_opt_t = 0; for(sets=1; sets<=500; sets++) gen_rand(); /* generates random values for PLJ, LAV, and RJ */ exec_t=exec_time(); /*evaluate execution time*/ total_exec_t += exec_t; opt_exec_t=find_opt(); /* evaluate optimal execution time*/ total_opt_t += opt_exec_t; average_exec_t = total _exec_t/500; average_opt_t = total_opt_t /500; diff = average_exec_t - average_opt_t; The results produced by the program shows the importance of scheduling the accessing to the critical section. Table 2 and Fig. 2 emphasize this fact. Branches Average execution time Time diff. % No Sched. Opt. Sched Table 2 Benefits of scheduling Average execution time randomly. It starts with LASV range which is double the range of the PLJ and RJ until LASV range reaches only 1% of PLJ and RJ ranges; the last cases is likely to appear in practice. For each range, it generates different number of branches, from 3 to 8. For each branch it generates 500 sets of random values for PLJ, LASV, and RJ. Then, for each set it orders the branches according to the scheduling algorithm,, Adjustment, or New Adjustment, and evaluates the execution time. Then, it finds the optimal execution time by exhaustive search, i.e. trying all possible permutations which equal the factorial of no. of the branches, to compare with. If the optimal time is not equal to the time resulted after applying the algorithm, the program counts this case as a not-optimal one and evaluates the time difference between the time of optimal and not optimal cases. Then, it aggregates the time differences resulted from the not optimal cases out of the overall 500 cases. After that, the program evaluates the percentage of the total time difference to the total optimal time. The following pseudo-code describes the structure of the main body of the program. for ( different ranges ) for(branches=3;branches<=8; branches++) total_exec_t = 0; total_opt_t = 0; not_opt = 0; for(sets=1; sets<=500; sets++) gen_rand(); /* generates random values for PLJ, LAV, and RJ within the current range */ sched_algorithm(); /*order branches according to the algorithm */ exec_t=exec_time(); /*evaluate execution time*/ total_exec_t += exec_t; opt_exec_t=find_opt(); /* evaluate optimal execution time*/ if(exec_t>opt_exec_t) not_opt++; total_opt_t += opt_exec_t; without scheduling with optimal scheduling Fig. 2 Average execution time without scheduling and with optimal scheduling t_diff = total_exec_t - total_opt_t; t_diff_percent = (t_diff/total_opt_t)*100; print(branches,not_opt,t_diff_percent); 4.2 Algorithms Evaluation To assess the scheduling algorithms, comparison, adjustment, and new adjustment, we developed a C++ simulation program. The program chooses different ranges, for accurately evaluating the algorithms, from which the values of PLJ, LASV, and RJ are selected The results are shown below, in the tables and the figures, with following notations: Not. opt.: is the number of not optimal cases, out of 500, resulted after applying the algorithm. Time diff.%: is the percentage of the total time difference to the total optimal time. 5

6 No. of Adjustment 1-2 Adjustment branches Not. opt. Time diff. % Not. opt. Time diff. % Not. opt. Time diff. % Table 3 Results for time costs ranges: 0 PLJ 100, 0 LASV 200, 0 RJ No. of not optimal cases Time diff. percentage No of branches (a) (b) Fig. 3 Comparing among the scheduling algorithms, for time costs ranges: 0 PLJ 100, 0 LASV 200, 0 RJ 100, w.r.t. a) no. of not optimal cases and b) time difference percentages. No. of Adjustment Adjustment branches Not. opt. Time diff. % Not. opt. Time diff. Not. opt. Time diff. % % Table 4 Results for time costs ranges: 0 PLJ 200, 0 LASV 200, 0 RJ 200 6

7 No. of not optimal cases Time diff. percentage (a) (b) Fig. 4 Comparing among the scheduling algorithms, for time costs ranges: 0 PLJ 200, 0 LASV 200, 0 RJ 200, w.r.t. a) no. of not optimal cases and b) time difference percentages. No. of Adjustment 1-2 Adjustment branches Not. opt. Time diff. % Not. opt. Time diff. % Not. opt. Time diff. % Table 5 Results for time costs ranges: 0 PLJ 1000, 0 LASV 200, 0 RJ No. of not optimal cases Time diff. percentage (a) (b) Fig. 5 Comparing among the scheduling algorithms, for time costs ranges: 0 PLJ 1000, 0 LASV 200, 0 RJ 1000, w.r.t. a) no. of not optimal cases and b) time difference percentages. 7

8 No. of Adjustment 1-2 Adjustment branches Not. opt. Time diff. % Not. opt. Time diff. % Not. opt. Time diff. % Table 6 Results for time costs ranges: 0 PLJ 2000, 0 LASV 50, 0 RJ Conclusion This paper summarizes the previous work in the area of scheduling algorithms, the algorithm and the Adjustment algorithm [2,4], to order processes that are competing to access shared variables. We added a new adjustment phase to the adjustment algorithm and found the best order of applying the three developed phases. Simulation results show the merits of the comparison algorithm and the adjustment algorithm. It also shows that the new approach adds more improvements to them. Although this algorithm does not give the optimal solution in some cases, the error level is very minor. These results suggest including the developed algorithms in designing parallel compilers. 6 References [1] Abraham Silberchatz, and Peter B. Galvin, Operating system concepts, Fourth edition, Addison-Wesley Inc., [2] R.A. Ammar, T.A. Fergany, and E.A. Maksoud, A fast algorithm to find the optimal accessing order of a critical section by parallel processes within a Fork-Join structure, [3] T. Casavant and J. Kuhl, A taxonomy of scheduling in general purpose distributed computing systems, IEEE Trans. Software Engineering, SE-14, 2, February [4] Ehab Yehia A. Maksoud, Optimal scheduling methods for competing processes within a parallel structure, Ph. D. dissertation, Faculty of Engineering, Cairo Univ., [5] H. El-Rewini and T. G. Lewis, Scheduling Parallel Program Tasks onto Arbitrary Target Machines, J. Par & Distr. Computing 9 (1990), pp [6] Kai Hwang and Faye A. Briggs, Computer Architecture and parallel processing, McGraw Hill, [7] Mohamad R. Neilforoshan-Daradshti, On time cost optimization of parallel structure within shared memory environment, P.h. D. dissertation, Computer Science & Engineering dept., Univ. of Connecticut, [8] Mohamad R. Neilforoshan-Daradshti, R. Ammar, and T.A. Fergany, Optimizing the time cost of parallel structure by scheduling parallel processes to access the critical section, Proceedings of the fourth international conference on computing and information (ICCI 92), Toronto, Canada, May 28-30, 1992, pp [9] B. Qin, H.A. Sholl, and R.A. Ammar, Micro time cost analysis of parallel computations, IEEE Trans. on Computers, vol. 40, No. 5, May 1991, pp [10] V. Sarkar and J. Henesy, Compile time partitioning and scheduling of parallel programs, In Proc. of Symp Compiler Construction, 1986, pp [11] P.L. Shaffer, Minimization of inter-processor synchronization in multiprocessor with shared and private memory, P proceedings of international conference on parallel processing, St. Charles, IL., August 8-12, 1989, vol. III, pp

Developing Scheduling Algorithms to Access the Critical Section in Shared-Memory Parallel Computers

Mansoura University Faculty of Engineering Computers & Automatic Control Dept. Developing Scheduling Algorithms to Access the Critical Section in Shared-Memory Parallel Computers A Thesis Submitted in