Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori

The Computer Journal, 46(6, c British Computer Society 2003; all rights reserved Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes Tori KEQIN LI Department of Computer Science, State University of New York, New Paltz, NY 12561, USA Email: lik@newpaltzedu A divisible load distribution algorithm on k-dimensional meshes tori is proposed analyzed It is found that by using our algorithm, the speed-up of parallel processing of a divisible load on k-dimensional meshes tori is bounded from above by a quantity independent of network size, due to communication overhead limited network connectivity In particular, it is shown that for k-dimensional meshes tori, as the network size becomes large, the asymptotic speed-up of processing divisible loads with corner initial processors is approximately β 1 1/2k,whereβ is the ratio of the time for computing a unit load to the time for communicating a unit load It is also proved that by choosing interior initial processors, an asymptotic speed-up of 2 k β 1 1/2k can be achieved Received 29 October 2002; revised 13 February 2003 1 INTRODUCTION A divisible load has the property that it can be arbitrarily divided into small load fractions which are assigned to parallel processors for processing Example applications include large-scale data file processing, signal image processing, scientific numerical computing, finiteelement engineering computations, database multimedia applications, many real-time computing problems such as target identification, searching data processing in distributed sensor networks [1] The problem of divisible load distribution on parallel distributed computing systems with static interconnection networks was first proposed by Cheng Robertazzi in 1988 [2] Since then, divisible load distribution, scheduling processing have been investigated by a number of researchers for the bus [3, 4], linear array [5], tree [3, 6, 7], two-dimensional mesh [8], two-dimensional toroidal mesh [9], three-dimensional mesh [10], hypercube [11] partitionable [12, 13] networks Other studies can be found in [14, 15, 16, 17] 1 The well-known Amdahl s Law [18] states that if fraction f of a computation is sequential cannot be parallelized at all, the speed-up is bounded from above by 1/f no matter how many processors are used For a divisible load, there is no inherently sequential part, that is f 0 However, this does not imply that unbounded speed-up can be achieved The reason is that Amdahl s Law has no restriction on a parallel system, where processors can communicate with each other without cost When a divisible load is processed on a multicomputer with a static interconnection 1 The reader is also referred to the Website http://wwwecesunysbedu/ tom/dlthtml for more references in this field network, there is communication overhead for distributing load among the processors Also, the network topology, that determines the speed at which a divisible load is distributed over a network, has a strong impact on performance (ie parallel processing time speed-up In this paper, we propose a divisible load distribution algorithm on k-dimensional meshes tori analyze the parallel time speed-up of the algorithm We derive a recurrence relation so that the ultimate parallel processing time asymptotic speed-up can be easily calculated for k-dimensional meshes tori It is found that by using our algorithm, the speed-up of parallel processing of a divisible load on k-dimensional meshes tori is bounded from above by a quantity independent of network size, due to communication overhead limited network connectivity In particular, it is shown that for k-dimensional meshes tori, as the network size becomes large, the asymptotic speed-up of processing divisible loads with corner initial processors is approximately β 1 1/2k,where β is the ratio of the time for computing a unit load to the time for communicating a unit load We also prove that by choosing interior initial processors, an asymptotic speed-up of 2 k β 1 1/2k can be achieved [19] The significance of our research is three-fold First, these results include the earlier results for linear arrays two-dimensional meshes in [13] as special cases Second, this paper provides a unified treatment for divisible load distribution on k-dimensional meshes tori for all k 1 Third, divisible load distribution on k-dimensional meshes tori with k>3has never been addressed before our work gives an initial investigation on these networks

626 K LI 2 THE MODEL We consider parallel processing of a divisible load on a multicomputer system with N processors P 1,P 2,,P N connected by a static interconnection network Each processor P i has n i neighbors It is assumed that P i has n i separate ports for communication with each of the n i neighbors; ie processor P i can send messages to all its n i neighbors simultaneously Once a processor sends a load fraction to a neighbor, it can proceed with other computation communication activities This provides the capability to overlap computation communication enhances the system performance However, a neighbor (receiver must wait until a load fraction arrives before it starts to process the load fraction It is this waiting time that limits the overall system performance Let T cm be the time to transmit a unit load along a link The time to send a load to a neighbor is proportional to the size of the load, with a negligible communication startup time Let be the time to process a unit load on a processor Again, the computation time is proportional to the size of a load We use β /T cm to denote the computation granularity, which is a parameter indicating the nature of a parallel computation a parallel architecture A large (small β gives a small (large communication overhead A computation intensive load has a large β a communication intensive load has a small β An infinite β implies that the communication cost is negligible 3 THE LOAD DISTRIBUTION ALGORITHM A k-dimensional mesh M k can be specified by k 1 positive integers N 1,N 2,,N k,wheren r 2 is the size in the rth dimension of the mesh, 1 r k M k has a set of N N 1 N 2 N k processors, P k {P j1,j 2,,j k 1 j r N r, 1 r k} Each processor P j1,j 2,,j k has neighbors P j1,,j r ±1,,j k if they exist Assume that there is a load x initially on processor P N1,N 2,,N k, called the initial processor The load is to be distributed over all the N processors of a k- dimensional mesh for parallel processing We now describe our algorithm A N1,N 2,,N k for processing divisible loads on a k-dimensional mesh M k of size N N 1 N 2 N k For notational convenience, a single processor is treated as a zero-dimensional mesh M 0 with a one-processor set P 0 Our algorithm A N1,N 2,,N k for processing a divisible load x on a k-dimensional mesh M k works as follows (A1 When N 1, the single processor processes load x by itself (A2 In general, when N > 1, the initial processor P N1,N 2,,N k sends a faction α of the load x to processor P N1,N 2,,N k 1 The remaining load (1 αx is processed by the (k 1-dimensional mesh M k 1 of size N 1 N 2 N k 1 with the set of processors P k 1 {P j1,j 2,,j k 1,N k 1 j r N r, 1 r k 1} by using the load distribution algorithm A N1,N 2,,N k 1 (A3 If N k > 2, processor P N1,N 2,,N k 1 is regarded as the initial processor of the k-dimensional mesh M k of size N 1 N 2 (N k 1 with the set of processors P k {P j 1,j 2,,j k 1 j r N r, 1 r k 1, 1 j k N k 1} Upon receiving the load αx by processor P N1,N 2,,N k 1, thek-dimensional mesh M k processes the load αx by using the load distribution algorithm A N1,N 2,,N k 1 (A4 If N k 2, processor P N1,N 2,,N k 1 is regarded as the initial processor of the (k 1-dimensional mesh M k 1 of size N 1 N 2 N k 1 with the set of processors P k 1 {P j 1,j 2,,j k 1,N k 1 1 j r N r, 1 r k 1} Upon receiving the load αx by processor P N1,N 2,,N k 1, the (k 1-dimensional mesh M k 1 processes the load αx by using the load distribution algorithm A N1,N 2,,N k 1 A k-dimensional torus is similar to a k-dimensional mesh except that each processor P j1,j 2,,j k has neighbors P j1,,(j r ±1 mod N r,,j k Since a k-dimensional torus contains a k-dimensional mesh as its subnetwork, algorithm A N1,N 2,,N k is also applicable for load distribution on k-dimensional tori 4 PARALLEL TIME AND SPEED-UP Let T N1,N 2,,N k denote the parallel time for processing one unit of load on a k-dimensional mesh M k of size N N 1 N 2 N k by using the load distribution algorithm A N1,N 2,,N k Since both computation communication times are linearly proportional to the amount of load, the time for processing x units of load on a k-dimensional mesh M k of size N N 1 N 2 N k is xt N1,N 2,,N k for all x 0 The speed-up S N1,N 2,,N k is defined as the ratio of the sequential processing time to the parallel processing time, namely T 1 S N1,N 2,,N k T N1,N 2,,N k T N1,N 2,,N k We are particularly interested in T,,, S,,, lim N 1,N 2,,N k T N 1,N 2,,N k lim S N 1,N 2,,N k, N 1,N 2,,N k

SPEED-UP OF PARALLEL PROCESSING OF DIVISIBLE LOADS 627 ie the ultimate parallel processing time the asymptotic speed-up when the size of a k-dimensional mesh goes to infinity We use T S to represent T,,, S,,, respectively, where there are k s We now prove the main result of the paper THEOREM 41 For k-dimensional meshes, we have T 1 ( 4 T cm + Tcm 2 T 1 ( 4T (k 1 T cm + Tcm 2, k 2 Furthermore, we have S β 1 1/2k for large β all k 1 Proof The value T N1,N 2,,N k can be obtained recursively as follows First, by (A1, we have T 1 In general, when N > 1, processor P N1,N 2,,N k can proceed without waiting after sending the load fraction α in (A2 Hence, the time spent by M k 1 is (1 αt N1,N 2,,N k 1 In(A3,it takes αt cm time for the load fraction α to reach processor P N1,N 2,,N k 1 Then M k requires αt N 1,N 2,,N k 1 time to process the load fraction α To minimize the parallel processing time T N1,N 2,,N k, we need to make sure that both M k 1 M k spend the same amount of time, ie This implies that T N1,N 2,,N k (1 αt N1,N 2,,N k 1 α(t cm + T N1,N 2,,N k 1 α T N1,N 2,,N k 1 T N1,N 2,,N k 1 + T N1,N 2,,N k 1 + T cm Hence, T N1,N 2,,N k satisfies the following recurrence relation, T 1 (1 T N1,N 2,,N k T N 1,N 2,,N k 1 (T N1,N 2,,N k 1 + T cm, T N1,N 2,,N k 1 + T N1,N 2,,N k 1 + T cm N>1 (2 Taking the limit on both sides of Equation (2, we obtain T (T + T cm, k 1 T + + T cm T T (k 1 (T + T cm, k 2 T + T (k 1 + T cm That is, T 2 + T cmt T cm 0, k 1 T 2 + T cmt T (k 1 T cm 0, k 2 Solving these quadratic equations, we get T T with k 2inthetheorem As for asymptotic speed-up, we prove by induction on k 1thatS β 1 1/2k for large β When k 1, we note that S T 2 4 T cm + T 2 cm T cm 4β + 1 1 β for large β Whenk 2, we have S T 2 4T (k 1 T cm + T 2 cm T cm 4(T(k 1 /T cm + 1 1 By the induction hypothesis, ie S (k 1 /T cm T (k 1 T (k 1 /T cm β β 1 1/2k 1, T (k 1 /T cm we know that T (k 1 /T cm β 1/2k 1 for large β Therefore, S β 1 1/2k, (3 4β 1/2k 1 + 1 1 for large β This proves the theorem The following corollaries are immediate consequences of Theorem 41 for two- three-dimensional meshes COROLLARY 41 For two-dimensional meshes, we have T, 1 ( 4T T cm + Tcm 2 ( 1 2 Furthermore, S, 2T cm 4 T cm + T 2 cm T 2 cm T cm β 3/4, T, 2 4β + 1 1 1 for large β COROLLARY 42 For three-dimensional meshes, we have T,, 1 ( 4T, T cm + Tcm 2 1 2T cm 2T cm 4 T cm +Tcm 2 2 T cm 2 T cm 2 T cm

628 K LI Furthermore, S,, T,, 2 2 β 7/8 4β + 1 1 1 1 for large β It is clear that Theorem 41 also holds for k-dimensional tori COROLLARY 43 For k-dimensional tori, we have T 1 ( 4 T cm + Tcm 2 T 1 ( 4T (k 1 T cm + Tcm 2, k 2 Furthermore, we have S β 1 1/2k for large β all k 1 In Table 1, we demonstrate numerical values of asymptotic speed-up S For each pair of β k, we give three values of S The first value is the exact value of S calculated by using Theorem 41 The second third values are estimations of S in Equation (3 These estimations are more accurate for small k large β than for large k small β 5 PERFORMANCE IMPROVEMENT Improved speed-up can be achieved by placing the initial load on an interior processor instead of corner boundary processors A submesh M k of a k-dimensional mesh M k of size N N 1 N 2 N k contains processors P k {P j 1,j 2,,j k a r j r b r, 1 r k}, where a r <b r for all 1 r k b r a r +1isthesizein the rth dimension of the submesh A processor P j1,j 2,,j k is called a boundary processor of M k in dimension r if j r a r or j r b r, a corner processor of M k if j r a r or j r b r for all 1 r k A processor P j1,j 2,,j k is called an interior processor of M k in dimension r if a r <j r <b r We say that a k-dimensional mesh M k is split at s r in dimension r if M k is divided into two disjoint submeshes M k containing processors P k {P j 1,j 2,,j k 1 j r N r,r r, 1 j r s r }, M k containing processors P k {P j 1,j 2,,j k 1 j r N r,r r, s r + 1 j r N r } If a processor P j1,j 2,,j k is an interior processor of a k-dimensional mesh M k in dimensions r 1,r 2,,r m, it will eventually become a corner processor of a submesh by splitting M k for m times Let T (m N 1,N 2,,N k denote the parallel time for processing one unit of load on a k-dimensional mesh M k of size N N 1 N 2 N k when the initial processor is an interior processor in m of the k dimensions, where 0 m k T (0 N 1,N 2,,N k is actually T N1,N 2,,N k Define T,,, (m lim T (m N 1,N 2,,N k N 1,N 2,,N k S,,, (m lim N 1,N 2,,N k S(m N 1,N 2,,N k, with abbreviation T (m S(m THEOREM 51 For k-dimensional meshes tori, we have T (m T /2 m S (m 2m β 1 1/2k for all 0 m k When the initial processor is an interior processor in all the k dimensions, the asymptotic speed-up of 2 k β 1 1/2k can be achieved Proof When m 1, the initial processor P N1,,j r,,n k is an interior processor in one dimension r The initial processor sends a fraction α of load x to one of its neighbors in dimension r, say,p N1,,j r +1,,N k The initial processor the selected neighbor are initial processors of two separate k-dimensional meshes M k M k obtained by splitting M k at j r in the rth dimension These two submeshes process the loads αx (1 αx respectively by using algorithm A N1,N 2,,N k It is clear that T (1 (1 αt α(t + T cm, which yields T (1 T (T + T cm T (β 1/2k + 1 2T + T cm 1/2k + 1 T 2 S (1 /T (1 1 1/2k for large β When m>1, the initial processor is an interior processor in dimensions r 1,r 2,,r m Thek-dimensional mesh M k is first split at j r1 in dimension r 1 The two resulted submeshes are further split in dimension r 2, the four resulted submeshes are further split in dimension r 3 so on It is not difficult to see that T (m T (m (1 αt (m 1 T (m 1 T (m (m 1 α(t + T cm (m 1 (T 2T (m 1 + T cm + T cm T (m 1 /2forall1 m k Hence, T (m T /2 m,s (m /T (m 2m ( /T 2 m β 1 1/2k In Table 2, we demonstrate numerical values of asymptotic speed-up S (m where k 3 It can be seen that the doubling effect S (m 2S(m 1 is stronger for small m large β than for large m small β

SPEED-UP OF PARALLEL PROCESSING OF DIVISIBLE LOADS 629 TABLE 1 Numerical values of asymptotic speed-up S β 1 2 3 4 5 6 7 1 1618 2317 3071 3865 4690 5537 6401 1618 1618 1618 1618 1618 1618 1618 1000 1000 1000 1000 1000 1000 1000 2 2000 3236 4633 6142 7731 9379 11073 2000 2532 2858 3040 3136 3186 3211 1414 1682 1834 1915 1957 1978 1989 5 2791 5384 8537 12073 15875 19870 24008 2791 4644 6089 7007 7526 7802 7945 2236 3344 4089 4522 4755 4876 4938 10 3702 8210 14053 20806 28188 36015 44169 3702 7423 10820 13186 14594 15363 15765 3162 5623 7499 8660 9306 9647 9822 20 5000 12808 23642 36572 50933 66297 82391 5000 11954 19272 24831 28304 30253 31286 4472 9457 13753 16585 18213 19085 19537 50 7589 23640 48175 78759 113465 151029 190640 7589 22668 41472 57380 67961 74102 77414 7071 18803 30662 39155 44246 47035 48495 100 10512 38103 83652 142397 210155 284122 362500 10512 37016 74226 108204 131860 145935 153630 10000 31623 56234 74989 86596 93057 96466 200 14651 61950 146515 259456 391874 537647 692847 14651 60722 133098 204164 255877 287416 304883 14142 53183 103134 143620 169482 184110 191890 500 22866 118968 310527 578785 900245 1258041 1641289 22866 117507 288800 472992 614813 704124 754439 22361 105737 229932 339066 411744 453731 476304 1000 32127 196021 551472 1067886 1697120 2403287 3163080 32127 194341 519882 893603 1193486 1386886 1497246 31623 177828 421697 649382 805842 897687 947464 2000 45224 324207 983500 1977959 3210240 4604602 6111473 45224 322265 937253 1689128 2317166 2731806 2971438 44721 299070 773395 1243700 1577149 1776035 1884693 5000 71212 633377 2124222 4489820 7487879 10 917256 14 644645 71212 631009 2046989 3922265 5571297 6693664 7353204 70711 594604 1724244 2936192 3831574 4376970 4678125 10 000 100501 1054012 3816058 8373410 14 249637 21 026595 28 424059 100501 1051249 3701562 7422609 10 820441 13 185993 14 593534 100000 1000000 3162278 5623413 7498942 8659643 9305720 k 6 NOTES ON RELATED WORK Performance limits to parallel processing of divisible loads on static interconnection networks have been observed previously Asymptotic performance analyses for linear arrays were conducted in [20] The special cases of Theorems 41 51 where k 1 for linear arrays are essentially similar to those in [21], the special cases of Theorems 41 51 where k 2 for twodimensional meshes were obtained in [12] An infinite twodimensional mesh with the initial processor in the center was considered in [8] However, our study deals with finite meshes It was shown in [10] that the speed-up of O(β can be achieved in three-dimensional meshes The result is

630 K LI TABLE 2 Numerical values of asymptotic speed-up S (m (k 3 β 0 1 2 3 1 3071 3825 4618 5440 2 4633 6030 7532 9112 5 8537 11690 15192 18954 10 14053 19895 26550 33814 20 23642 34477 47134 61176 50 48175 72710 102337 135926 100 83652 129201 185571 250553 200 146515 231080 338290 463980 500 310527 502086 752607 1053023 1000 551472 906922 1382517 1962793 2000 983500 1642793 2544734 3664595 5000 2124222 3615067 5713174 8379599 10 000 3816058 6578103 10 546049 15 678934 obtained by adopting the circuit-switched routing technique which assumes that communication times are independent of the distances among processors Note that part (A4 of algorithm A N1,N 2,,N k is not related to the analysis in this paper The reason is that, in this paper, we increase the network size N by fixing k increasing the sizes of all the dimensions It is also possible to increase N by fixing N 1,N 2,,N k increasing the number of dimensions k For instance, when all the N r sarefixed at 2 N increases as k increases, we get hypercubes For hypercubes, we use the (k 1-dimensional mesh M k 1 in (A2 the (k 1-dimensional mesh M k 1 in (A4 to process the load fractions (1 αx αx, respectively Therefore, the analysis of parallel time speed-up for processing divisible loads on hypercubes follows a different direction [13] 7 CONCLUDING REMARKS We have proposed a divisible load distribution algorithm on k-dimensional meshes tori analyzed the parallel time speed-up of the algorithm We have shown that by using our algorithm on k-dimensional meshes tori, as the network size becomes large, the asymptotic speed-up of processing divisible loads with corner initial processors is approximately β 1 1/2k We have also proved that by choosing interior initial processors, the asymptotic speed-up of 2 k β 1 1/2k can be achieved The improved speed-up for large k is due to the increased network connectivity which yields a faster speed for load distribution Our work includes earlier results of linear arrays two-dimensional meshes as special cases, provides a unified treatment for divisible load distribution on k-dimensional meshes tori for all k 1 gives an initial investigation of divisible load distribution on k-dimensional meshes tori with k>3 m ACKNOWLEDGEMENTS The author wishes to express his gratitude to two anonymous reviewers for their criticism comments This material is based upon work supported by the US National Science Foundation under Grant No CCR-0091719 Any opinions, findings conclusions or recommendations expressed in this material are those of the author do not necessarily reflect the views of the National Science Foundation REFERENCES [1] Bharadwaj, V, Ghose, D, Mani, V Robertazzi, T G (1996 Scheduling Divisible Loads in Parallel Distributed Systems IEEE Computer Society Press, Los Alamitos, CA [2] Cheng, Y C Robertazzi, T G (1988 Distributed computation with communication delays IEEE Trans Aerospace Electron Syst, 24, 700 712 [3] Bataineh, S, Hsiung, T-Y Robertazzi, T G (1994 Closed form solutions for bus tree networks of processors load sharing a divisible job IEEE Trans Computers, 43, 1184 1196 [4] Sohn, J Robertazzi, T G (1996 Optimal divisible job load sharing for bus networks IEEE Trans Aerospace Electron Syst, 32, 34 40 [5] Mani, V Ghose, D (1994 Distributed computation in linear networks: closed-form solutions IEEE Trans Aerospace Electron Syst, 30, 471 483 [6] Barlas, G D (1998 Collection-aware optimum sequencing of operations closed-form solutions for the distribution of a divisible load on arbitrary processor trees IEEE Trans Parallel Distributed Syst, 9, 429 441 [7] Cheng, Y C Robertazzi, T G (1990 Distributed computation for a tree network with communication delays IEEE Trans Aerospace Electron Syst, 26, 511 516 [8] Błażewicz, J Drozdowski, M (1996 The performance limits of a two-dimensional network of load sharing processors Found Comput Decision Sci, 21, 3 15 [9] Błażewicz, J, Drozdowski, M, Guinard, F Trystram, D (1999 Scheduling a divisible task in a two-dimensional toroidal mesh Discrete Appl Math, 94, 35 50 [10] Drozdowski, M Głazek, W (1999 Scheduling divisible loads in a three-dimensional mesh of processors Parallel Computing, 25, 381 404 [11] Błażewicz, J Drozdowski, M (1995 Scheduling divisible jobs on hypercubes Parallel Computing, 21, 1945 1956 [12] Li, K (1998 Managing divisible load on partitionable networks In Schaeffer, J (ed, High Performance Computing Systems Applications, pp 217 228 Kluwer Academic Publishers, Boston, MA [13] Li, K (2003 Parallel processing of divisible loads on partitionable static interconnection networks Cluster Computing (Special Issue on Divisible Load Scheduling, 6, 47 55 [14] Błażewicz, J Drozdowski, M (1997 Distributed processing of divisible jobs with communication startup costs Discrete Appl Math, 76, 21 41 [15] Błażewicz, J, Drozdowski, M Markiewicz, M (1999 Divisible task scheduling concept verification Parallel Computing, 25, 87 98

SPEED-UP OF PARALLEL PROCESSING OF DIVISIBLE LOADS 631 [16] Ko, K (2000 Scheduling Data Intensive Parallel Processing in Distributed Networked Environments PhD dissertation Department of Electrical Computer Engineering, State University of New York, Stony Brook, New York [17] Sohn, J, Robertazzi, T G Luryi, S (1998 Optimizing computing costs using divisible load analysis IEEE Trans Parallel Distrib Syst, 9, 225 234 [18] Amdahl, G M (1967 Validity of the single processor approach to achieving large scale computing capabilities In Proc AFIPS Spring Joint Computer Conf, Vol 30, pp 483 485 CSREA Press [19] Li, K (2002 Speedup of parallel processing of divisible loads on k-dimensional meshes tori In Proc Int Conf on Parallel Distributed Processing Techniques Applications, Las Vegas, Nevada, June 24 27, pp 171 177 [20] Ghose, D Mani, V (1994 Distributed computation with communication delays: asymptotic performance analysis J Parallel Distrib Comput, 23, 293 305 [21] Bataineh, S Robertazzi, T G (1992 Ultimate performance limits for networks of load sharing processors In Proc Conf Information Sciences Systems, pp 794 799 Princeton University Press, Princeton, NJ