Predicting the response time of a new task on a Beowulf cluster

Predicting the response time of a new task on a Beowulf cluster Marta Beltrán and Jose L. Bosque ESCET, Universidad Rey Juan Carlos, 28933 Madrid, Spain, mbeltran@escet.urjc.es,jbosque@escet.urjc.es Abstract. In this paper the problem of making predictions of incoming tasks response times in a cluster node is focused. These predictions have a significant effect in areas such as dynamic load balancing, scalability analysis or parallel systems modelling. Response time predictions need an estimation of the CPU time that will be available for tasks during their execution. All the tasks in the run queue share the processor time in a balanced way but the CPU time consumed by each task depends on the type of task it is, CPU-bound or not. This paper presents two new response time prediction models. The first one is a mixed model based on two widely used models, CPU availability and Round Robin models. The second one, called RTP model, is a completely new model based on a detailed study of different kinds of tasks and their CPU time consuming. The predictive power of these models is evaluated by running a large set of tests and the predictions obtained with the second proposed model exhibit an error of less than 2 % in all these experiments. 1 Introduction Beowulf clusters are becoming very popular due to their good price-performance ratio, scalability and flexibility. Thus, the study of this kind of systems is a research area of increasing interest ([1], [11], [8]). In Beowulf clusters, the CPU is one of the most important resources([2]). Because of the dynamic nature of these systems, the CPU load on each of the cluster nodes can vary drastically in a very short time. Predicting the amount of work on the different cluster nodes is a basic problem that arises in a lot of contexts such as cluster modelling, performance analysis or dynamic load balancing. If the response time of a new task is known on each of the cluster nodes the load estimation is very easy. But response times can only be measured for completed processes and very often these times must be known before jobs begin their execution. Therefore, the mentioned applications require a prediction, implicit or explicit, of the response time for a new task on the different cluster nodes. In this paper the CPU assignment (A) is proposed to measure and compare the load of different cluster nodes, in other words, the response time of a new task in each of these nodes. The assignment is defined as the percentage of CPU time that would be available to a newly created task. CPU availability has been successfully used before, for example, to schedule programs in distributed systems ([6], [12]). The problem of predicting the available CPU in a cluster node is examined. The contributions of this paper are an analytical and experimental study of two well-known response time prediction models, two new static models for CPU assignment computation, a verification of these models through their application in a complete set of experiments and a comparison between all the obtained results. In contrast to the other approaches, the response time predictions with the second proposed model exhibits an error of less than 2%, so the experimental results indicate that this new model is accurate enough for all the mentioned contexts. The rest of the paper is organized as follows. Section 2 discusses related work on predicting processor workload. Section 3 presents two existing CPU assignment models and proposes two improved new models. Experimental results on comparing the four discussed models are reported in Section 4. And finally, Section 5 with conclusions and suggestions for future work. 2 Background Research that is closely related to this paper falls under two different models: based on predicting the future from past or based on the task queue model. As an example of the first kind of models, [13] focused on making short and medium term predictions of CPU availability on time-shared Unix systems. On the other hand, [10] presented a method based on neural networks for

automatically learning to predict CPU load. And finally [4] and [5] evaluated linear models for load prediction and implemented a system that could predict the running time of a compute-bound task on a host. Queueing models have been widely used for processors due to their simplicity, so the other kind of models is more extended. In a highly influential paper, Kunz ([9]) showed the influence of workload descriptors on the load balancing performance and concluded that the best workload descriptor is the number of tasks in the run queue. In [17] the CPU queue length is used too as an indication of processors load. And this load index is used again, for example, in [3], [15] and [16]. Finally, in [7], the number of tasks in the run queue is presented as the basis of good load indices but an improvement is proposed by averaging this length over a period of one to four seconds. 3 Response time prediction models In this paper the CPU assignment (A) is defined as the percentage of CPU time that would be available to a new incoming task in a cluster node. This parameter is used in this paper to analyse prediction models because the response time of a task is directly related to the average CPU assignment it has during its execution. If a process is able to obtain 50% of CPU time slices, it is expected to take twice as long to execute as it would if the CPU was completely unloaded. So a response time prediction for a new task will require a prediction of the CPU assignment for this task during its execution. There are two popular response time prediction models widely used, for example, in dynamic load balancing applications. These models decide how to map new tasks to cluster nodes, determining the less loaded node, thus, predicting in which node the response time of the new task will be shorter. They define a load index such as the percentage of available CPU or the number of tasks in the run queue and base their predictions on this index value. But they do not take into account the influence of new tasks on system performance. The assignment concept tries to consider the effects of executing new tasks on CPU availability. 3.1 Previous models analysis The most simple approach is to consider the less loaded node as the node with more free or idle CPU. Analysing this model from the CPU assignment point of view, it considers the CPU assignment as the available CPU at a given instant: A = Available CP U (1) Thus, the predicted assignment for a new task is the percentage of CPU idle time. This model, called CPU availability model in the rest of this paper, has one important drawback: it does not take into account processor timesharing between tasks. For example, if one cluster node is executing one CPU-bound task (consuming almost all of he CPU time), this model predicts around a 5% of CPU assignment for a new task, but in a time shared system, the new task would have around a 50% of CPU assignment because the two tasks would share the CPU time. In most of computer systems tasks share the processor time using a Round Robin scheduling policy ([14]). In this policy a time slice or quantum (q) is defined. The CPU scheduler picks a process from the run queue and dispatches it to the processor. If the process is still running at the end of its quantum, it is preempted and added to the queue tail. But if the process finishes or sleeps before the end of the quantum, it leaves the processor voluntarily. The other well-known response time prediction model is based on this scheduling and takes the node with less tasks in the run queue as the less loaded cluster node. So the assignment is predicted as the percentage of CPU time that corresponds to a new task with this scheduling policy. If the number of tasks in the run queue is N, the assignment prediction with the Round Robin model is: A = 1 N +1 Because the processor time will be shared in a balanced way between N+1 tasks. This model is widely used but it only considers CPU-bound tasks. These tasks are computing intensive CPU operations all the time but do not make memory swapping or I/O operations (with disks or network). Indeed, a node executing one CPU-bound task could give less assignment to a new task than a node executing several no CPU-bound tasks. But this model always predicts more assignment for a new task in the first case. (2)

3.2 Proposed models To overcome these limitations and take into account all kind of tasks without monitoring other resources like memory or network, a mixed model is proposed, combining the two previous prediction models. Let U denote the CPU utilization (percentage of CPU time used for the execution of all the tasks in the run queue). The CPU assignment prediction for a new task with this model is: A = ρ 1=(N +1)if U 1=N 1 U otherwise Therefore, if there are only CPU-bound tasks executing on a processor, assignment is obtained applying the Round Robin model. But when there are no CPU-bound tasks, they are not taking advantage of all their corresponding CPU 1 time and the CPU assignment for an incoming task will be all the available CPU, of course, greater than. This N+1 model takes the best of the two models exposed in the previous subsection, so it may perform well with a run queue with all CPU-bound tasks (Round Robin model) and, at the other extreme, with all no CPU-bound tasks (CPU availability model). But it is a mystery how this model will perform when there are different types of tasks in the run queue. Finally, an improvement for this model is proposed, with a more sophisticated explanation about how CPU time is shared between different tasks. This model is called RTP model (Response Time Prediction model). Considering the Round Robin scheduling, CPU-bound tasks always run until the end of their time slices while no CPU-bound tasks sometimes leave the processor without finishing their quantums. The remaining time of these slices is consumed by CPU-bound tasks, always ready to execute CPU intensive operations. The aim is to take into account this situation, so let t CPU denote the CPU time consumed by a no CPU-bound task when the CPU is completely unloaded and t denote the response time for this task in the same environment. The fraction of time spent in CPU intensive operations for this task is: X = t CPU t Suppose that there are n CPU-bound tasks (denoted by CPU-b) in the run queue and m no CPU-bound tasks (denoted by CPU b). Therefore, N = n+m and the proposed model predicts the following assignment for the i th no CPU-bound task when there is a new incoming task: (3) X i A(CPU b) i = (n +1) q P m + X (4) k=0 k The new task is supposed to be CPU-bound because it would be the worst case, when the new task would consume all its CPU slices. So, with the new incoming task there will be n+1 CPU-bound tasks in the run queue. Using the predicted assignments for all no CPU-bound tasks, the assignment for a new task can be computed as all the CPU time that is not consumed by no CPU-bound tasks shared with Round Robin policy between CPU-bound tasks: A = 1 P m i=0 A(CPU b) i n +1 (5) 4 Experimental results To determine the accuracy of these four models, a set of experiments has been developed in order to compare measured and predicted response times. The criteria used to evaluate the assignment models is the relative error of their predictions. All the experiments take place on a 550 MHz Pentium III PC with 128 MB of RAM memory. The operating system installed in this PC is Debian Linux kernel version 2.2.19 and it uses a Round Robin scheduling policy with a 100 millisecond time slice (q = 100 ms).

Load X t CPU (s) t (s) test0 1.00 12.00 12 test1 0.66 7.94 12 test2 0.50 6.11 12 test3 0.40 4.81 12 test4 0.30 3.65 12 test5 0.20 2.45 12 test6 0.10 1.20 12 test7 0.05 0.60 12 Table 1. Test loads 4.1 Test loads Different synthetic workloads have been generated due to the lack of appropriate trace data and to their simplicity. The CPU-bound test load (test0) is a very simple program, loop consume CPU end loop And the CPU intensive operation used to consume processor time is a vectorial product. On the other hand, the no CPU-bound test loads (testi with i=1, 2,...,7) are: loop consume u milliseconds of CPU sleep s milliseconds end loop Different loads have been generated (table 1) controlling the percentage of consumed CPU with u and s parameters (because X = u=s). Besides, to avoid a possible influence of memory hierarchy on our experiments, all test loads use data stored in L1 cache memory. 4.2 Experiments The first set of experiments to evaluate the models validity and accuracy is performed statically. Thus different sets of test loads are executed simultaneously in our system, beginning and ending at the same time. As can be seen in the first column of the tables 2 and 3 these sets of loads combine different kind of tasks, CPU-bound, and no CPU-bound with different CPU utilization percentages. In each experiment, CPU and response times are measured for all test loads. In order to determine the most accurate model, assignment predictions are made for the task called test0 in each experiment. This task is selected because it is a CPU-bound task, as it is supposed to be a new incoming task in the system (supposing the worst case). If A P is the predicted assignment for this task, the predicted response time is: t t P = A P where t is the response time for the task called test0 when it is executed on the unloaded CPU. Therefore a model accuracy can be determined with the relative error of this prediction: 100 (tm t P ) e = abs (6) t m where t m is the response time measured when the load test0 is executed simultaneously with other tasks in a certain experiment. The results obtained with all these experiments are detailed in tables 2 and 3, and figures 1, 2 and 3. In the tables the response time measured for test0 (t m ) and the CPU time measured for tasks in each experiment (denoted by t CPU1,

Exp. t m tcpu1 t CPU2 t CPU3 t CPU4 t PA e A(%) t PB e B(%) t PC e C(%) t PD e D(%) test0, test0 24.02 12.01 12.01 - - 240.20 899.58 24.02 0.00 24.02 0.00 24.02 0.04 test0, test1 20.01 12.03 7.72 - - 35.38 76.82 24.06 20.24 24.06 20.24 19.94 0.12 test0, test2 18.19 12.08 6.01 - - 24.16 32.85 24.16 32.82 24.16 32.75 18.14 0.36 test0, test4 15.80 12.01 3.58 - - 17.16 8.59 24.02 52.02 17.16 8.59 15.63 1.10 test0, test6 13.38 12.07 1.20 - - 13.41 0.23 24.14 80.42 13.41 0.23 13.29 0.76 test0, test0, test0 36.03 12.01 12.01 12.01-480.40 1211.14 36.03 0.00 36.03 0.00 36.03 0.00 test0, test0, test1 32.14 12.01 12.01 7.70-70.65 119.81 36.03 12.10 36.03 12.10 32.00 0.53 test0, test0, test3 29.16 12.02 12.02 4.71-40.07 37.40 36.06 23.66 36.06 23.66 28.85 1.00 test0, test0, test5 26.80 12.08 12.08 2.40-30.20 12.69 36.24 35.22 30.20 12.69 26.60 0.72 test0, test0, test7 24.96 12.05 12.05 0.60-25.37 1.64 36.15 44.83 25.37 1.64 24.78 0.78 test0, test0, test0, test0 48.09 12.01 12.01 12.01 12.01 720.60 1398.44 48.04 0.10 48.04 0.10 48.04 0.10 test0, test0, test0, test1 40.20 12.04 12.04 12.04 7.72 106.24 139.00 48.16 8.35 48.16 8.34 44.29 0.21 test0, test0, test0, test2 40.20 12.01 12.01 12.01 6.00 72.06 68.96 48.04 12.64 48.04 12.64 42.28 0.68 test0, test0, test0, test4 40.19 12.02 12.02 12.02 3.58 51.51 28.18 48.06 19.63 48.08 19.63 39.63 0.21 test0, test0, test0, test6 37.80 12.06 12.06 12.06 1.20 40.20 6.35 48.24 27.62 40.20 6.35 37.45 0.83 Table 2. Results with the four discussed models. Sets of experiments reported in figures. Exp. t m tcpu1 t CPU2 t CPU3 t CPU4 t PA e A(%) t PB e B(%) t PC e C(%) t PD e D(%) test0, test5, test6 15.86 12.02 2.43 1.20-17.17 8.27 36.06 127.36 17.20 9.69 15.65 0.18 test0, test2, test7 18.67 12.01 5.94 0.52-26.69 42.95 36.03 92.98 26.84 43.71 18.72 0.24 test0, test1, test4 23.53 12.07 7.73 3.58-301.75 1182.41 36.21 53.89 36.21 54.02 23.66 0.54 test0, test3, test6 18.03 12.02 4.78 1.02-24.04 33.33 36.06 100.00 24.04 33.33 18.05 0.14 test0, test0, test6, test7 26.04 12.03 12.03 1.06 0.50 14.16 45.62 48.12 84.79 28.31 8.70 25.89 0.63 test0, test0, test5, test6 27.88 12.02 12.02 2.40 1.02 17.24 37.98 48.12 72.45 34.34 23.18 27.76 0.14 test0, test0, test4, test7 28.44 12.01 12.01 3.59 0.51 18.55 34.72 48.04 68.92 36.95 29.94 28.34 0.28 Table 3. Results with the four discussed models. Remaining experiments t CPU2, t CPU3 and t CPU4 ), are presented together with the predicted response time for test0 (t P ) and the percentage of relative prediction error (e). All the time values are measured in seconds. There are four predicted times and prediction errors because the four presented models are evaluated : CPU availability model (model A), Round Robin model (model B), mixed model (model C) and RTP model (model D). The assignment predictions have been computed using equations 1, 2, 3 and 5 respectively. Results in table 2 are reported in figures 1, 2 and 3. Figure 1 corresponds to a set of experiments with one CPUbound task (test0) and one variable no CPU-bound task. This task increases its X value from test7 (X =5%)to test1 (X = 66%) and some of these results are not in tables for space restrictions. The prediction error for the CPUbound task response time is plotted against the fraction of CPU time for the no CPU-bound task (defined as X in the previous section). This curve is plotted for the four discussed models. Figures 2 and 3 present the results for the same kind of experiments but with two and three CPU-bound tasks respectively. The remaining experiments, with other combinations of tasks, are showed in table 3. From both, the tables and the figures, it is clear that large reductions in prediction errors are experienced in all the experiments using the RTP model. Indeed the prediction error with this model is always less than 2 %. There are not instances in which one of the others models perform better than RTP model. Figures 1, 2 and 3 show how the prediction error with the CPU availability model varies with the X value for the no CPU-bound task. For low X values the prediction error is low too, but as this value increases, the prediction error is sharply increasing. As it is said before, this can be attributed to the model, which ignores the possibility of time-sharing between tasks. A task with X around 50 % (near CPU-bound tasks) would share the processor time in a balanced way with the tasks in the run queue and it is opposed to the prediction made with this model, which predicts a very low CPU assignment for this task.

90 80 CPU availability model Round Robin model Mixed model RTP model 70 60 e (%) 50 40 30 20 10 0 0 0.2 0.4 0.6 0.8 1 X of no CPU-bound task Fig. 1. Prediction error for the discussed models with one CPU-bound task and other varying task 50 45 40 CPU availability model Round Robin model Mixed model RTP model 35 30 e (%) 25 20 15 10 5 0 0 0.2 0.4 0.6 0.8 1 X of no CPU-bound task Fig. 2. Prediction error for the discussed models with two CPU-bound tasks and other varying task 40 35 CPU availability model Round Robin model Mixed model RTP model 30 25 e (%) 20 15 10 5 0 0 0.2 0.4 0.6 0.8 1 X of no CPU-bound task Fig. 3. Prediction error for the discussed models with three CPU-bound tasks and other varying task In contrast to this approach, the Round Robin model performs very well with large X values (near CPU-bound tasks) but the prediction error increases dramatically when X decreases. This was expected because this model does

not take into consideration the remaining time of the CPU time slices left by no CPU-bound tasks. So, the assignment prediction when there are this kind of tasks in the run queue is always less than its real value. Finally, for the mixed model, the value of e falls with low and large values of X. The previous results give some insight into why the error varies in this way. This model is proposed to take the best of the CPU availability and Round Robin models. Thus, the mixed model curve converges with the CPU availability model curve at low values of X and with the Round Robin model curve at large values. Notice that the error increases for medium values of X, and this is the disadvantage of this model, although it supposes a considerable improvement over the two previous models because the prediction error does not increases indefinitely. Still, even this last model is not superior to the RTP model. Besides the low prediction error values obtained with this model in all the experiments, figures and tables show the imperceptible dependence of this error on the kind of tasks considered. 5 Conclusions The selection of a response time prediction model is nontrivial when minimal prediction errors are required. The main contribution of this paper is a detailed analysis of two existing and two new prediction models for all kind of tasks in a real computer system. The CPU assignment concept has been introduced to predict response times. Using previous prediction models, the greatest source of error in making predictions comes from considering only the current system load. But due to Round Robin scheduling policies, the execution of a new incoming task has a significant influence on response times of all the tasks in the system run queue. Thus, CPU assignment is introduced to consider the current system load and the effects of executing a new task on the CPU availability. A wide variety of experiments has been performed to analyse, in terms of CPU assignment prediction, the CPU availability and Round Robin models accuracy. The results presented in the previous section reveal that these models perform very well in certain contexts but fail their predictions in others. The CPU availability model obtains errors between 0 and 10 % in experiments with CPU-bound tasks and one no CPU-bound task with low X. But errors increase dramatically, going beyond 1000 % when X increases. The Round Robin model results are completely different. The prediction error is near 0 % when all the tasks in the run queue are CPU-bound but increases when one or more tasks are no CPU-bound. In these cases error can reach more than a 100 %. These results suggest that it may not be unreasonable to combine these two models for improving their predictions. So, the first proposed model (mixed model) is a simple combination of these two discussed models. And the experimental results indicate that an important improvement over these models can be obtained. With high and low X values error is as low as it was with CPU availability and Round Robin models. And with the remaining experiments the prediction error does not exceed a 54 %. Finally, an optimized and relatively simple model is proposed. The RTP model is based on a study of the CPU time sharing and the scheduling policies used by the operating system. This model takes into consideration the influence of a new task execution on the set of tasks in the run queue. Experimental results demonstrate the validity and accuracy of this model, the prediction error is always less than 2 %. Thus the RTP model has been shown to be effective, simple and very accurate under static conditions. In the context of Beowulf clusters these results are encouraging. A very interesting line for future research is to extend the RTP model to dynamic environments. This may require some changes in the model to avoid using a priori information about tasks such as the tasks percentage of CPU utlization (X). But it would provide us with a dynamic model, very useful for predicting the response time of new incoming tasks on cluster nodes. References 1. Gordon Bell and Jim Gray. What s next in high-performance computing? Communications of the ACM, 45(2):91 95, 2002. 2. Rajkumar Buyya. High Performance Cluster Computing, Volume 1: Architecture and Systems. Prentice-Hall PTR, 1999. 3. K. Benmohammed-Mahieddine; P.M. Dew and M. Kara. A periodic symmetrically-initiated load balancing algorithm for distributed systems. In Proceedings of the 14th International Conference on Distributed Computing Systems, 1994. 4. Peter A. Dinda. Online prediction of the running time of tasks. In Proceedings. 10th IEEE International Symposium on High Performance Distributed Computing, pages 336 337, 2001.

5. Peter A. Dinda. A prediction-based real-time scheduling advisor. In 16th International Parallel and Distributed Processing Symposium. IEEE, 2002. 6. Francine D. Berman et al. Application-level scheduling on distributed heterogeneous networks. In Proceedings of Supercomputing 1996, 1996. 7. D. Ferrari and S. Zhou. An empirical investigation of load indices for load balancing applications. In 12th IFIP International Symposium on Computer Performance Modelling, Measurement and Evaluation. Elsevier Science Publishers, 1987. 8. John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, 2002. 9. Thomas Kunz. The influence of different workload descriptions on a heuristic load balancing scheme. IEEE Transactions on Software Engineering, 17(7):725 730, 1991. 10. Pankaj Mehra and Benjamin W. Wah. Automated learning of workload measures for load balancing on a distributed system. In Proceedings of the 1993 International Conference on Parallel Processing. Volume 3: Algorithms and Applications, pages 263 270, 1993. 11. Gregory F. Pfister. In search of clusters: The Ongoing Battle in Lowly Parallel Computing, 2nd ed. Prentice Hall, 1998. 12. Neil T. Spring and Richard Wolski. Application level scheduling of gene sequence comparison on metacomputers. In International Conference on Supercomputing, pages 141 148, 1998. 13. R.Wolski; N. Spring and J. Hayes. Predicting the cpu availability of time-shared unix systems on the computational grid. In Proceedings of the Eighth International Symposium on High Performance Distributed Computing, pages 105 112. IEEE, 1999. 14. A. S. Tanenbaum. Distributed Operating Systems. Prentice-Hall, Inc., 1995. 15. Gil-Haeng Lee; Wang-Don Woo and Byeong-Nam Yoon. An adaptive load balancing algorithm using simple prediction mechanism. In Proceedings of the Ninth International Workshop on Database and Expert Systems Applications, pages 496 501, 1998. 16. Kai Shen; Tao Yang and Lingkun Chu;. Cluster load balancing for fine-grain network services. In International Parallel and Distributed Processing Symposium, pages 51 58, 2002. 17. S. Zhou. A trace-driven simulation study of dynamic load balancing. In IEEE Transactions on Software Engineering, pages 1327 1341, 1988.