Makuhari, Chiba 273, Japan Kista , Sweden. Penny system [2] can then exploit the parallelism implicitly

Size: px

Start display at page:

Download "Makuhari, Chiba 273, Japan Kista , Sweden. Penny system [2] can then exploit the parallelism implicitly"

Dennis Foster
5 years ago
Views:

1 Dynamic Scheduling in an Implicit Parallel System Haruyasu Ueda Johan Montelius Institute of Social Information Science Fujitsu Laboratories Ltd. Swedish Institute of Computer Science Makuhari, Chiba 273, Japan Kista , Sweden Abstract Penny is a system that exploits ne-grained parallelism in an AKL program. During execution, a set of workers (processes) share a set of tasks that are created dynamically. To achieve good parallel speedup, a worker must be assigned a new task as soon as it becomes idle. Since no compiler support nor user annotations are available to guide the execution, the Penny system needs a very ecient dynamic scheduler. We have developed a congurable scheduler in order to experiment with dierent approaches. We evaluated and analyzed the approaches by running a small set of benchmarks for which statistics and performance are reported. Keywords: Dynamic scheduling, implicit parallel, concurrent logic language, concurrent constraint, auto-scheduler 1 Introduction The use of parallel computers has so far been restricted to systems written in programming languages that make explicit use of the machine resources. Therefore large regular problems, that is easily divided into independent parts, are the main applications of parallel computers. However, as parallel computers are increasing in popularity, parallelizing even smaller irregular problems is becoming more interesting. The parallelizing of irregular problems by handis very dicult. The rst obstacle is to code the problem in a way that allows parallel execution. This is probably the hardest part since it often requires both the redesigning of the algorithms and the hazard of shared data and locks that can (and will) create both erroneous results and deadlocks. The second problem is to divide the program into a suitable numberoflarge parts and to assign these to processors. If the problem is irregular, the task can be very dicult. The work is made easier by using a concurrent language, such as AKL [1], that allows the programmer to implement communicating processes without having to use locks or barriers for synchronization. The Penny system [2] can then exploit the parallelism implicitly available in the program, i.e. there is no need for user annotations. This places high demands on the implementation of the Penny system it must be able to handle the scheduling of tasks very eciently. 2 Penny System The Penny system is a parallel implementation of the concurrent constraint language AKL[1] on a shared memory parallel computer. The features of AKL are deep guards and encapsulated search. Both so-called AND- and OR-parallelisms are exploited bythepenny system. This is a very simple model of the Penny system for understanding the scheduler [1, 2]. In an execution of Penny, asetofworkers is rst spawned by the system. A worker is implemented as a thread in an operating system and is dynamically scheduled to a processor by the operating system. A worker should always be assigned to a processor so there is no need to create more workers than the number of available processors. All workers are equal, i.e., there is no Master-Slave relationship. Each worker will execute a given task, and while doing so it will possibly generate new tasks. When a task is completed, a new task must be found and assigned to the worker. The problem is how anew task should be found by the scheduler. We have divided the scheduler into two parts: a local scheduler and a global scheduler. The local scheduler takes care of the scheduling of tasks generated by aworker itself. As long as a worker has enough tasks, it will only call the local scheduler. When a worker runs out of tasks, it will call the global scheduler. It will try to locate a new task either in a global pool of tasks or by taking a task from another worker. The local scheduler must exist also in a sequential implementation since the tasks also reect the concur-

2 rent nature of the system. However, the global scheduler is only needed in a parallel implementation. 2.1 The Execution State The execution state can be divided into a global structure that is shared among all workers and local structures that are owned by individual workers. The local structures can be accessed by the other workers but they are controlled by the owner. The global structure is a representation of all available processes. Each process is either new, running or suspended. The local structures of a worker consist of a pointer to at most one running process, a continuation task stack,andawake task stack. A busy worker that is executing a running process can generate new processes. A pointer to a such process is a continuation task. The worker can also generate data that will allow a suspended process to continue its work. A Pointer to a such process is a wake task. Eachtaskis pushed on the task stack for it owned by the worker. 2.2 The Local Scheduler Aworker will continue to execute the running process until it either terminates or suspends. There is no preemption of processes in the system. When a worker has nished the execution of a process, it must select a new process from the last one created or the last woken one. Some systems have chosen a strategy called eagerwake, where a worker stops the execution of a running process as soon as a wake task is generated. This approach has some benets but it requires very often context switch of high cost. As well as the cost, in a implicit parallel system, it should be avoided since this is not under the control of the programer. Therefore we have decided to choose the last created process than waken process. This paper will not discuss the strategies of the local scheduler more since this would require a deeper understanding of the execution mechanism and semantics of the AKL language. 2.3 The Global Scheduler In the global scheduler, both types of tasks can be moved from a busy worker to an idle worker. There is no dierence among these types of tasks in terms of switching overhead. The global scheduler does however try to prioritize distribution of wake tasks since these are more expensive for a busy worker to handle. The main diculty with the global scheduler is that the performance of a scheduler will drastically change depending on the executed program. Dierent programs can have very dierent mixes of parallel and sequential parts. A scheduler that excels for one type of program can perform poorly on another. The goal is not to nd a scheduler that outperforms other schedulers on certain programs, but to nd a scheduler that has a predictable behavior for all types of programs. There is no methodology for designing a scheduler without having a real system to experiment with. It is very hard to predict performance for a cache-based shared-memory multiprocessor because the memory usage and cache hit-ratio is often the limiting factor. The Penny systemwas therefore developed with a con- gurable global scheduler to enable to experiment with dierent approaches. The system is designed so that dierent scheduler approaches can be tested without changing the basic execution mechanism. We have experimented with four dierent approaches two using a global pool of tasks and two working directly with the local stacks. An advantage of using a global task pool is that it makes load balancing easier. The disadvantage is it complicates the implementation and in some cases the overhead is greater than the eects of a better balancing of the load. The schedulers also dier in who is responsible for the distribution of tasks. This responsibility canbe placed on idle or busy workers. If the responsibility is placed on busy workers, it is questionable how much overhead a busy worker is allowed to pay in order to keep idle workers from waiting. The four schedulers are: The busy workers periodically check if the global pool is empty and if there are any idle workers. If so, some of its tasks are moved to the global pool. An idle worker will then collect them from here. The scheduler is driven by busy workers without the use of a global task pool. Busy workers will check if there are any idleworkers each a new task is created. If so, an idle worker will be given the task directly. The scheduler is driven by idle workers without the use of a global task pool. An idle worker will look for a task directly in the stacks of the busy workers. When a task is found, the idle worker will steal it. The scheduler is similar to the but there is only one \thief" active at any given moment.

3 This thief will however steal as many tasks as possible, take some for itself but place the rest in the global pool. The other idle workers will simply wait for task to be placed in the global pool Execution (Life benchmark) response Hood 2.4 Implementation There is no central scheduler process. Instead, each worker performs the necessary operations to distribute tasks. Since the workers will access some shared datastructures, some locks have to be used. The locks are implemented as spin locks using atomic-swap instructions. The task pool is implemented as a FIFO queue and is protected by a lock. The lock overhead is very small and few collisions occur. In both the and schedulers, an idle worker will access the local stacks of a busy worker directly. Since the task stacks are accessed both by the owner worker and idle workers, the stacks are protected by a lock. In the scheduler driven by thebusy workers, no lock is necessary for the local stacks. The overhead of the lock could cause a large overhead but it turns out that as long as a worker is left alone, the overhead is very small. In the scheduler a busy worker must detect if there are any idle workers. This test is integrated with the garbage collection test that has to be done anyway. By setting the garbage collection ag, an idle worker will stop the busy workers. The busy workers must then detect if actual garbage collection was necessary or if an idle worker was requesting tasks. This scheme induces very low overheads as long as all workers are busy. In the scheduler, a worker must determine if there is an idle worker each a task is created. This is much more frequent than the test for garbage collection and does induce a noticeable overhead. 3 Evaluation We used a Sun SparcCenter with 8 processors running the Solaris 2.4 operating system. Each experiment was done while no one else used the machine. Up to fourteen workers were allocated in spite of the fact that only eight processors were available. This is done in order to simulate what will happen on a loaded system. A scheduler that performs well on a lightly loaded machine can have a very bad performance on a loaded machine. For each scheduler and for each benchmark, we measured the total execution for dierent numbers of workers. Each experiment was done forty s Figure 1: Execution of Life benchmark. Exec. Total busy Total idle Av. busy Av. idle No. of scheduling Table 1: Statistics of Life benchmark with 7 workers and the shortest was taken for the evaluation. In addition to the execution, statistics were gathered from an execution log. A specially compiled version of the system will log the each worker becomes idle and when it resumes execution. This information is then used to count the total busy-, idle-, and global scheduling operations. This can then be used to estimate the average for and between scheduling operations. All will be reported in milliseconds. 3.1 The Game of Life The benchmark is an implementation of the \game of life" where each cell is implemented as an AKL process. Each process has to communicate with all of its neighbors to determine its next state. This causes an abundance of tasks that can be executed in parallel. Figure 1 shows the execution with each scheduler according to the number of workers. Table 1 shows the statistics gathered from an experiment using seven workers. The execution reported in Table 1 diers signicantly from one in Figure 1 since the overhead of generating the log-le for this benchmark can not be ignored.

4 Towers of Hanoi (solution only) Towers of Hanoi Figure 2: Execution of Hanoi benchmark, solution only. Up to eight workers, the dierences between the schedulers are not very large. There is an initial overhead payed by the and schedulers using one worker, but this is regained when more workers are used. The scheduler does not perform as well as the other schedulers when between four to eight workers are used. One reason for this performance is found in Table 1. The scheduler performs more than three s as many scheduling operations as the others. This can be explained both the and the schedulers can use the global pool to move more tasks in a scheduling operation. The most interesting thing is the dramatic decrease in performance of the voluntary scheduler when more than eight workers are used. In the scheduler, the idle workers must waitforabusyworker to distribute tasks. If a busy worker is not scheduled by the operating system, it can not distribute its task and the idle workers must spend their -slot waiting. The scheduler does not suer dramatically by this eect probably because a busy worker will detect idle workers more quickly. 3.2 Towers of Hanoi The benchmarks are two implementations of the \Towers of Hanoi" puzzle. The rst one only generates a list of all necessary plate movements of a solution. As can be seen in Figure 2, this benchmark does not show any dierences between the schedulers apart from a small overhead for the scheduler. In the other benchmark, a procedure that traverses Figure 3: Execution of Hanoi benchmark, with counting. Number of busy workers Number of busy workers for each 100 ms Elapsed (msec) Figure 4: Number of busy workers for Hanoi benchmark (3 workers). the list and counts the number of moves was added. This procedure can be done in parallel with the rst part, since the list of solutions is produced incrementally, but it can in itself not be parallelized. The counting of the list sets an upper bound on the obtainable speedup, no strategy can execute the program in less than 900 ms., which is the to traverse the list. The and schedulers now perform signicantly better than the other schedulers when four workers are used. There is also a dierence when more than eight workers are used but then in favor for the scheduler. Figure 4 shows the number of busy workers in each 50 ms. interval during an execution using three work-

5 Matrix multiplication 2800 Smith-Waterman algorithm Penny gcc 1 gcc-o2 gcc-o Figure 5: Execution of Matrix benchmark Figure 6: Smith-Waterman. ers. As can be seen, the and schedulers terminate the parallel part much earlier but then spend more to complete the execution. The and schedulers have scheduled the counting procedure at an early phase and havetraversed the list at the almost same speed as it has been produced. To achieve good parallel performance for this benchmark, the counting procedure would have tobe scheduled with a higher priority but there is no way to annotate an AKL program for the programmer. This is the drawback of a completely implicit parallel system. 3.3 Matrix Multiplication The benchmark is a multiplication of a matrix and avector. A worker will start working on the rst row of the matrix and create a continuation task for the remaining rows. This task has to be assigned to another worker who in turn will. After starting with the second row, create a new continuation task etc. There is at most one task available and it will create some strange behavior. Figure 5 shows that the and the schedulers have very good parallel performance. The schedulers are not disturbed even when more than eight workers are used. The scheduler and scheduler have very poor performance. The statistics gathered from the executions does not fully explain why but it is clear that the single task does not propagate quickly enough. Both schedulers perform very few scheduling operations. 3.4 Smith-Waterman The Smith-Waterman algorithm is used when DNA sequences are compared. A typical application is to nd a sequence which is the best match with a certain sequence from a database. This is an obvious parallel application since all comparison are independent. It is more challenging to parallelize the Smith- Waterman algorithm itself. In the Penny system, this is done automatically without any changes of the original AKL implementation of the algorithm. In order to be competitive withacprogram, an extra builtin was added to the Penny system. The builtin performs the most primitive arithmetic operation in the algorithm and increases the overall performance with a factor three. In the development of the Penny system, very little eort has been spent on compiling arithmetic operations. Two sequences with 600 elements were compared to each other. The experiments were executed on asparccenter- with twenty processors and the minimum execution of 100 runs are reported. The C program was allowed to do the execution ten s to avoid the initial cost of lling the caches and initializing memory blocks, this is not reported by the Penny system. Figure 6 shows the performance of the Penny system compared to a optimized C program compiled with gcc with dierent optimizations levels. The Penny system with two processors outperforms the plain gcc compiled program. With three processors, it almost performs as fast as the C program compiled with -O4 option. The results are very encouraging. Although a high

6 Proc KLIC Penny Penny/KLIC Table 2: The game of life ( in milliseconds) level language such asaklwillhave a hard to compete with a C compile, the results show thatan implicit parallel system can match the sequential C program even with as little as two processors. 4 Related works 4.1 KLIC One of the best concurrent logic program systems is the KLIC compiler [4]. It compiles KL1 programs into C and produces very fast code. The KL1 language is almost a compleat subset of the AKL language and the benchmarks presented in this paper uses almost same program constructs as of KL1 language. For the above benchmarks, the main dierence between KLIC and Penny is that KLIC is an explicit parallel system. The programmer has to annotate the code to make it run in parallel and the gained performance is depending on the skill of the programmer. To compare the two systems, a \life" benchmark of KLIC was selected. The KL1 version is a reduced game of life, where each cell only has four neighbors. The grid is divided into clusters which are then distributed on the available processors. We ran the experiment with a 30x40 grid divided into twelve clusters of 10x10 cells each. The division allows an even distribution of clusters on available processors. The timings were the best in ten consecutive runs, thus avoiding the extra it takes to boot the system. In the Penny version, no annotations are necessary to parallelize the program. The program is simplied since there is no need to divide the program up into clusters all cells are treated equally. The gures in the Table 2 show that the Penny system, for this benchmark, is only half as slow as the KLIC system while the normal factor between the KLIC system and Penny is around four to six (somes up to ten). The main reason why the life benchmark shows so good performance is: Although the KLIC system is much faster at decoding instructions, since it does not have the overhead of an emulator, a large part of the execution is spent in other parts of the system. The instruction decoding handler is not so important for this benchmark. In the KLIC system, binding shared variables (shared between nodes) is considerably more expensive than binding non-shared variables. In the life benchmark, about 10% of the communication (through variables) is performed with shared variables. In the Penny system all variables are potentially shared and much eort has been spent making the binding operation as ecient as possible. 4.2 Auto-scheduler The combination of a global task pool and local task stacks is similar to the idea of the distributed task queue in an auto-scheduling [3] environment. In auto-scheduling, compiler and explicit denotation of the program helps the destination of the task. In the Penny system, all tasks are rst pushed in the local stack and moved by the global scheduler only when needed. It does not need any explicit indication for the destination processor in a program. Since this scheduler is congurable depending on the program instead of xed in the system, it can be more ecient and exible. 5 Summary In the development of the schedulers, the statistics gathered from the system during executions have been very important. It has explained many, but not all, strange behavior. Only measuring the execution or number of scheduling events is not enough. One need to have a trace of the execution in order to understand the behavior. Gathering the statistics must be performed with a minimum of interference since the execution otherwise will behave dierently. The scheduler was the best allround scheduler. It did not always out perform the other schedulers but had a predictable behaviour. The problems with the schedulers do not show up when highly parallel benchmarks are executed. The problems emerge when there is little parallelism, when the obtained speedup is depending on the distribution of a single task or when the machine has a high load. A scheduler that performs well on a unloaded machine can break down if the executed on a machine with high load. This is often neglected since it is more convenient to run the benchmarks on a unloaded system.

7 Acknowledgements We thank people at SICS for discussions, especially Prof. Seif Haridi and Dr. Sverker Janson. References [1] Sverker Janson and Seif Haridi, \Programming paradigms of the Andorra Kernel Language Programming," Logic Programming: Proc. of the 1991 Int'l Logic Programming Symposium. MIT Press, [2] Johan Montelius and Khayri Ali, \An and/or parallel implementation of AKL," New Generation Computing, 13(4), [3] Jose E. Moreira and Constantine D. Polychronopoulos, \Autoscheduling in a Distributed Shared-Memory Environment," Languages and Compilers for Parallel Computing, 7th Int'l Workshop Proc., LNCS 892, Springer-Verlag, [4] KLIC. klic-requests@icot.or.jp, ICOT.

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract