Scheduling Techniques of Processor Scheduling in Cellular Automaton

Size: px

Start display at page:

Download "Scheduling Techniques of Processor Scheduling in Cellular Automaton"

Barnard Cross
6 years ago
Views:

1 International Conference on Intelligent Computational Systems (ICICS'22) Jan. 7-8, 22 Dubai Scheduling Techniques of Processor Scheduling in Cellular Automaton Mohammad S. Laghari and Gulzar A. Khuwaja Abstract Many problems in computer simulation of systems in science and engineering present potential for parallel implementations through one of the three major paradigms of algorithmic parallelism, geometric parallelism and processor farming. Static process scheduling techniques have been used successfully to eploit geometric and algorithmic parallelism, while dynamic process scheduling is better suited to dealing with the independent processes inherent in the process farming paradigm. This paper considers the application of parallel or multi-computers to a class of problems ehibiting spatial data dependency characteristic of the geometric paradigm. However, by using processor farming paradigm in conjunction with geometric decomposition, a dynamic scheduling technique is developed to suit the MIMD structure of the multi-computers. The specific problem chosen for the investigation of scheduling techniques is the computer simulation of Cellular Automaton models. Keywords Cellular Automaton, multi-computers, parallel paradigms, scheduling. S I. INTRODUCTION TATIC and dynamic scheduling of processes are techniques that can be used to optimize performance in parallel computing systems. When dealing with such systems an acceptable balance between communication and computation times is required to ensure efficient use of processing resources. When the time to perform the compute on a subproblem is less than the time taken to receive the data or transmit the results, then the communication bandwidth becomes a limit to performance. With dynamic scheduling, an appropriate program can redirect the flow of data at run time to keep the processors as busy as possible and help achieve optimum performance []. The problem chosen here for the investigation of scheduling techniques is the cellular automaton (C.A.). The C.A. approach has been used in many applications, such as image processing, self learning machines, fluid dynamics and modeling parallel computers. Because of their small compute requirements, many C.A. algorithms implemented on a network of processors, ehibit the above discussed imbalance. Mohammad S. Laghari is with the Electrical Engineering Department, Faculty of Engineering, United Arab Emirates University, P.O. Bo: 7555, Al Ain, U.A.E. (phone: ; fa: ; mslaghari@uaeu.ac.ae). Gulzar A. Khuwaja is with the Department of Computer Engineering, College of Computer Sciences & Information Technology, King Faisal University, Al Ahsa 3982, Kingdom of Saudi Arabia ( Khuwaja@kfu.edu.sa). A cellular automaton simulation, with artificially increased compute load per cell (in the form of number of simulated multiplies) is considered for parallelization. Such a simulation is representative of a class of recursive algorithms with local spatial dependency and fine granularity that may be encountered in biological applications, finite elements, certain problems in image analysis and computational geometry [2]- [5]. These types of applications ehibit geometric parallelism and may be considered best suited to static scheduling. However, using dynamic scheduling, the MIMD structure of multicomputer networks is eploited, and comparison of both the schemes is given in the form of total timings and speedup. II. THE C.A. MODEL Cellular automata were introduced in the late forties by John von Neumann, following a suggestion of Stan Ulam, to provide a more realistic model for the behavior of comple, etended systems [6]. In its simplest form, a cellular automaton consists of a lattice or line of sites known as cells, each with value or. These values are updated in a sequence of discrete time steps according to a definite, fied, rule. The overall properties of a cellular automaton are usually not readily evident from its basic rule. But given these rules, its behavior can always be determined by eplicit simulation on a digital computer. Cellular automata are mathematical idealizations of physical systems in which space and time are discrete, and physical quantities take on finite set of discrete values. The C.A. model used in this investigation is a -dimensional cellular automaton where processing takes place by a near homogeneous system having a fine level of granularity. It is conceptually simple and has a high degree of parallelism. It consists of a line of cells or sites i, (where i =,..., n) with periodic boundary conditions n = which means that last cell in the line of site is connected to the first cell. Each cell can store a single value or variable known as its state. At regular intervals in time the value of cells are simultaneously (synchronously) updated according to a local transition rule whose result depends on the previous state of the cell and those of its neighbors. The neighborhood of a given site is simply the site itself and the sites immediately adjacent to it on the left and right. Each cell may eist in one of two states i = or. The local rules of C.A. can be described by an eight-digit binary number as shown in the following eample. Fig. specifies one particular set of rules for an elementary C.A. 96

2 International Conference on Intelligent Computational Systems (ICICS'22) Jan. 7-8, 22 Dubai Fig. The 8 possible states of 3 adjacent sites The top row gives all the 2 3 = 8 possible values of the three sites in the neighborhood, and below each one is given the values achieved by the middle site on the net time step according to a particular local rule. As any eight-digit binary number specifies a cellular automaton, therefore there are 2 8 = 256 possible distinct C.A. rules in one dimension with a 3 site neighborhood. The rule in the lower line of the Fig. is rule number 5 () which have been used for the implementation of C.A. algorithms in this paper. The rules may be considered as a Boolean function of the sites within the neighborhood. Let i (t) be the value of site i at time step t. For the above eample, the value of a particular site is simply the sum modulo two of the values of its own and its two neighboring sites on the previous time step. The Boolean equivalent of this rule is given by: TABLE I FINDING RULE NUMBER IN BINARY FORM A B C output i ( t ) = ( i ( t) i ( t) i ( t))) REM 2 where REM is the remainder function. This can be written in the form of: i ( t ) = i ( t) i i ( t) or schematically = where, denotes addition modulo two or eclusive disjunction, denotes value of a particular site for the net time step and,, denotes values of its own and its neighboring sites on the previous time step, respectively. The following shows how the above equations relate to rule number 5 of C.A. Suppose A, B, C Fig. 2 Evolution of -D C.A. through two time steps Fig. 3 shows evolution of -dimensional elementary cellular automaton according to the above described rule, starting from a state containing a single site with value. Sites and are represented with * s and s, respectively. The configuration of the cellular automaton at successive time steps is shown on successive lines. The time evolution is shown for at most 2 time steps or up to the point where system is detected to cycle. then using Boolean laws the schematic equation becomes: A B C ( A. B A. B) C A. C A. C A. C A. C Putting this equation in the truth Table I shows the output giving rule number 5 in the binary form when read from the most significant bit = = 5 Fig. 2 shows the evolution of a particular state of the C.A. through two time steps in the above eample. Fig. 3 Evolution of C.A. into a configuration up to 2 time steps III. PARALLEL PARADIGMS In order to efficiently utilize the computational potential of a large number of processors in a parallel processing environment, it is necessary to identify the important parallel features of the application. There are several simple paradigms for eploiting parallelism in scientific and engineering applications, but the most commonly occurring types fall into three classes. These three paradigms are described in more detail in [7], [8]. 97

3 International Conference on Intelligent Computational Systems (ICICS'22) Jan. 7-8, 22 Dubai A. Algorithmic Parallelism Is present where the algorithm can be broken down into a pipeline of processors. In this decomposition the data flows through the processing elements. Geometric Parallelism Is present where the problem can be broken down into a number similar processes in such a way as to preserve processor data locality and each processor operate on different subset of the total data to be processed. C. Processor Farm Is present where each processor is eecuting the same program with different initial data in isolation from all the other processors in the farm [9], []. Results of all iterations are communicated back to the master processor. Simulation tests are carried out for 2 iterations or time steps using from to 7 slave processors, supplied with fied size array segments for the total array length of 768 cells. Artificially increased compute loads in the form of multiplies per cell (in steps of 2 multiplies) are introduced. Five loads of 2, 4, 6, 8 and multiplies, respectively are used, which reside in the worker process of each slave. Table II shows the total timings in seconds for a normal and a range of artificially increased compute loads. TABLE II TOTAL TIMING IN SECONDS FOR 2 ITERATIONS IN STATIC SCHEME IV. ALGORITHMS In order to meet the high speed and performance, a scalable and reconfigurable multi-computer system (NPLA) is used. This networked multi-computer system is a bit similar to the NePA system used to implement Network-on-Chip []. The system used is a linear array of processors. It includes RISC processors and memory blocks. Each processor in the array has a compactor, internal instruction memory, internal data memory, data control unit, and registers. One of the processors is used as a master or main processor and the remaining as slaves. The system has a network interface with the main processor having four and others equipped with two port routers. Routers can transfer both control as well as application data among processors. The two scheduling algorithms are described: A. Static Algorithm In this implementation of cellular automaton, the problem is statically implemented by using array processing. The algorithm is properly decomposed by using geometrical parallelism. Ideally, the master processor should distribute fied number of cells uniformly across the ring of slave processors. At the start of an individual iteration, each cell process broadcasts the current state of the cell to its neighbors in parallel with inputting the states of its neighbors from the neighboring cell processes. After this echange of data, the cell update its new state using the rule described earlier. Instead of individual cell processes in each slave which communicated with the neighboring cells after every update, the master processor distributes fied size array segments of cells (for a total length of a maimum768 cells) uniformly across the worker array, with each processor being responsible for the defined spatial area. Each iteration starts with slave processors first echange boundary information with the neighboring processors in such a way that end elements of each array segment carry information of the end elements of the neighboring segments. After this echange, the array segment updates the results with the help of the neighboring elements for all the elements in parallel by using the cellular automaton rule described earlier. The updated results are assigned in another array. Result without the additional compute load shows no improvement in performance when the algorithm is implemented on multiple processors. The communications take more time than the computation in each slave. Results with the compute load of 2 additional multiply show that there is a reasonable improvement in timings. The comparison shows that with the increase in the compute load, the overall performance of the algorithm and the utilization of the processors proportionally improve. Dynamic Algorithm In the previous implementation the allocation of processes to processors is defined at compile time. It is possible to have the program perform the process allocation as it runs. In this implementation of cellular automaton, distribution of processing loads is performed dynamically. The topology used is the same as in the previous eamples, which is a master processor and up to 7 slaves, now operating as a farm of processors with the code replicated on each of them. In this algorithm, the master processor distributes work packets to the farm of slave processors. This processor is also responsible for geometrical decomposition and the tracking of the work packets through the iteration sequence. It consists of two main processes of send and receive, which eecute in parallel and share two large arrays of data send and data receive. At the start of the first iteration, the send process farms out fied size data packets from the send array (which contains the line of site to be computed) to the slave processors. Each data packet includes; an array segment of cells, address of the segment location in the send array, and information about the end elements of the neighboring segments. 98

International Conference on Intelligent Computational Systems (ICICS'22) Jan. 7-8, 22 Dubai The slave processors operate two main processes both running in parallel.

4 International Conference on Intelligent Computational Systems (ICICS'22) Jan. 7-8, 22 Dubai The slave processors operate two main processes both running in parallel. One is a worker process where actual computation takes place and is run in low priority with the other which is a work_packet_schedular as shown in Fig. 4. Fig. 4 Work packet schedular on slave processors The work_packet_schedular on each slave consists of: a schedular process which inputs data packets from the master and schedules tasks through buffers either to the worker process or to the net processor in the chain of slaves on the first come first served basis. The buffers operate as request buffers which is as soon as the buffers have served their tasks, more work is requested from the scheduler process. If request for work from the worker process and net processor arrive at the same time then priority is given to the worker process. a data_passer process which inputs resultant data through buffers both from the worker process or previous processor on the first come first served basis and forwards it to the net processor leading towards the master processor. In order to keep the slave processors busy, the task schedular buffers an etra item of work so that when the worker process completes the computation for an array segment it can start on its net at once rather than having to wait for the master processor to send the net item of work. The worker process inputs the array segment together with the information of the end bits of neighboring segments and the address bits. Then, updates the segment according to the C.A. rule described earlier, stores the result in another array, adds address bits and communicates it to the data_passer process. The processed array segments together with the address bits are received by the other main process of receive in the master processor and are placed in the data receive array at the appropriate positions. This completes the first iteration. For subsequent iterations, array segments can only be sent for processing if adjoining neighbors are present; this is because of the end element information of the neighboring segments. Therefore, as soon as the master processor receives 3 contiguous segments in the data receive array, it copies the middle segment to the data send array. When 3 contiguous segments are copied to the data send array, then the middle segment from this array is sent to the slaves for further processing. Eperiments are performed on the dynamically allocated scheme by varying the network sizes, the computational loads, and the size of the work packets in order to obtain optimum performance parameters. Timings from to 7 slave processors are obtained for 2 iterations. Eperiments are performed with varying packet sizes of 2, 24, and 48 cells for the total array length of 768 cells. Additional compute loads in the form of 2, 4, 6, 8 and multiplies, are used. Table III shows computation timings in seconds for the array lengths of 24 for the dynamic scheme. The results of dynamic allocation show reasonable improvements in timings for the three packet sizes; the eception being the compute load of 2 multiplies which shows small improvements in performance for smaller networks. TABLE III TIMING FOR 2 ITERATIONS IN DYNAMIC SCHEME FOR 24 CELLS The speedup for the packet size of 24 cells show very good results for all the additional compute loads ecept for case of 2 multiply as shown in Fig. 5. A near linear speedup is shown when four slave processors are used. For the load of 6 multiplies, speedup of 5.76 is achieved when all the slaves are used. The results for the three segment sizes of 2, 24 and 48 cells are compared with artificially increased compute loads in terms of speedup. For comparison, compute loads of 2 and multiplies are chosen. Fig. 6 shows speedup, for the case of 2 multiplies. Array size of 2 cells shows no improvements in the result. The reason being that for the case of 2 cells, the master processor distributes 64 array segments for each line of site of 768 cells. Therefore, the master communicates a total of 28 array segments to do 2 iterations. With the compute load of 2 multiplies for each cell, the system does not balance the computation and communication loads. The results prove that 99

Increasing the size of the data packets for the additional load of 2 multiplies has a small effect on the performance.

6 Comparison of speedup results for the load of 2 multiplies Fig. 7 shows the speedup, for the case of multiplies.

Again, the array size of 24 cells gives the best performance results for using all the available slave processors.

5 International Conference on Intelligent Computational Systems (ICICS'22) Jan. 7-8, 22 Dubai the system is taking much more time to communicate data packets of this size to and from the slave processors and thus show poor performance. Increasing the size of the data packets for the additional load of 2 multiplies has a small effect on the performance. The array size of 48 cells shows slight improvements for up to 3 slave processors. Fig. 8 Comparison of timings between the two schemes Fig. 5 Speedup for 24 cells in dynamic scheme Fig. 6 Comparison of speedup results for the load of 2 multiplies Fig. 7 shows the speedup, for the case of multiplies. Ecellent results are obtained for all the array segments, when from to 4 slave processors are used. Again, the array size of 24 cells gives the best performance results for using all the available slave processors. Therefore, when comparing the results for all the additional compute loads, array segment of size 24 with the compute load of multiplies gives the best performance parameters in the dynamic scheduling scheme. Fig. 7 Comparison of speedup for the load of multiplies Fig. 8 shows the timing comparison for two schemes for seven processors. Ecept for 2 compute load, the dynamic scheme performs better for all other loads. V. CONCLUSION In this paper we have considered a modified C.A. model with artificially increased load. The recursive structure and spatial data dependency of this algorithm is representative of an important class of algorithms in science and engineering. The paper investigates the performance of scheduling techniques for the implementation of this type of algorithm on multicomputer networks. Eperiments performed on implementation of above techniques suggest that over certain ranges of compute load, dynamic scheduling can outperform its rival in terms of speedup. REFERENCES [] T. L. Casavant and J. G. Kuhl, A Taonomy of Scheduling in General- Purpose Distributed Computing Systems, IEEE Trans. on Software Engineering, vol. 4, no. 2, Feb [2] M. V. Avolio, A. Errara, V. Lupiano, P. Mazzanti, and S. D. Gregorio, Development and Calibration of a Preliminary Cellular Automata Model for Snow Avalanches, in Proc. 9th Int. Conf. on Cellular Automata for Research and Industry, Ascoli Piceno, Italy, 2, pp [3] D. Cacciagrano, F. Corradini, and E. Merelli, Bone Remodelling: A Comple Automata-Based Model Running in BIO SHAPE, in Proc. 9th Int. Conf. on Cellular Automata for Research and Industry, Ascoli Piceno, Italy, 2, pp [4] M. Ghaemi, O. Naderi, and Z. Zabihinpour, A Novel Method for Simulating Cancer Growth, in Proc. 9th Int. Conf. on Cellular Automata for Research and Industry, Ascoli Piceno, Italy, 2, pp [5] Y. Zhao, S. A. Billing, and A. F. Routh, "Identification of Ecitable Media Using Cellular Automata Models, Int. J. of Bifurcation and Chaos, vol. 7, pp , 27. [6] A. IIanchinski, Cellular Automata A Discrete Universe. Singapore: World Scientific Publishing, 2. [7] D. J. Pritchard, Transputer Applications on Supernode, in Proc. Int. Conf. on Application of Transputers, Liverpool, U.K., Aug [8] M. S. Laghari and F. Deravi, Scheduling Techniques for the Parallel Implementation of the Hough Transform, in Proc. Engineering System Design and Analysis, Istanbul, Turkey, 992, pp [9] A. S. Wagner, H. V. Sreekantaswamy, and S. T. Chanson, Performance Models for the Processor Farm Paradigm, IEEE Trans. on Parallel and Distributed Systems, vol. 8, no. 5, pp , May 997. [] A. Walsch, Architecture and Prototype of a Real-Time Processor Farm Running at MHz, Ph.D. Thesis, University of Mannheim, Mannheim, Germany 22. [] Y. S. Yang, J. H. Bahn, S. E. Lee, and N. Bagherzadeh, Parallel and Pipeline Processing for Block Cipher Algorithms on a Network-on- Chip, in proc. 6th Int. Conf. on Information Technology: New Generations, Las Vegas, Nevada, Apr. 29, pp

Scheduling Techniques to Classify Wear Particles on Multi-Computers

Scheduling Techniques to Classify Wear Particles on Multi-Computers Mohammad Shakeel Laghari 1 & Gulzar Ali Khuwaja 2 1 Department of Electrical Engineering UAE University, Al Ain, United Arab Emirates