Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori

Similar documents
No.6 Distributing and Scheduling Divisible Task 789 ssor Connection Graph and denote it by PCG = (P; L; S; C; A). Each node P i in P represents a proc

All-port Total Exchange in Cartesian Product Networks

9.5 Equivalence Relations

9 Distributed Data Management II Caching

7 Distributed Data Management II Caching

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes

Lecture 15: The subspace topology, Closed sets

Using Graph Partitioning and Coloring for Flexible Coarse-Grained Shared-Memory Parallel Mesh Adaptation

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube

On Algebraic Expressions of Generalized Fibonacci Graphs

Bounds on the signed domination number of a graph.

Monetary Cost and Energy Use Optimization in Divisible Load Processing

On the packing chromatic number of some lattices

The Fibonacci hypercube

Odd Harmonious Labeling of Some Graphs

CSE 20 DISCRETE MATH WINTER

The super connectivity of augmented cubes

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*

ARITHMETIC operations based on residue number systems

EDGE MAXIMAL GRAPHS CONTAINING NO SPECIFIC WHEELS. Jordan Journal of Mathematics and Statistics (JJMS) 8(2), 2015, pp I.

PROOF OF THE COLLATZ CONJECTURE KURMET SULTAN. Almaty, Kazakhstan. ORCID ACKNOWLEDGMENTS

Fully discrete Finite Element Approximations of Semilinear Parabolic Equations in a Nonconvex Polygon

The strong chromatic number of a graph

CHAPTER 8. Copyright Cengage Learning. All rights reserved.

CLASSIFICATION OF SURFACES

Component connectivity of crossed cubes

Heuristic Algorithms for Multiconstrained Quality-of-Service Routing

An Improved Upper Bound for the Sum-free Subset Constant

Bar k-visibility Graphs

Embedding Large Complete Binary Trees in Hypercubes with Load Balancing

Signature Search Time Evaluation in Flat File Databases

Analytical Modeling of Parallel Programs

A Distributed Formation of Orthogonal Convex Polygons in Mesh-Connected Multicomputers

Planar graphs. Math Prof. Kindred - Lecture 16 Page 1

College of Computer & Information Science Fall 2007 Northeastern University 14 September 2007

Here, we present efficient algorithms vertex- and edge-coloring graphs of the

Module 7. Independent sets, coverings. and matchings. Contents

Component Connectivity of Generalized Petersen Graphs

Progress Towards the Total Domination Game 3 4 -Conjecture

Math 5593 Linear Programming Lecture Notes

Tilings of Parallelograms with Similar Right Triangles

Parameterized graph separation problems

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

184 J. Comput. Sci. & Technol., Mar. 2004, Vol.19, No.2 On the other hand, however, the probability of the above situations is very small: it should b

Delay-minimal Transmission for Energy Constrained Wireless Communications

Minimum-Link Watchman Tours

ON HARMONIOUS COLORINGS OF REGULAR DIGRAPHS 1

Global Minimization via Piecewise-Linear Underestimation

AXIOMS FOR THE INTEGERS

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

A Real Coded Genetic Algorithm for Data Partitioning and Scheduling in Networks with Arbitrary Processor Release Time

Fast algorithms for max independent set

Mathematical and Algorithmic Foundations Linear Programming and Matchings

A Vizing-like theorem for union vertex-distinguishing edge coloring

Balance of Processing and Communication using Sparse Networks

Edge-Disjoint Cycles in Regular Directed Graphs

Embedding a graph-like continuum in some surface

Compact Sets. James K. Peterson. September 15, Department of Biological Sciences and Department of Mathematical Sciences Clemson University

Lecture : Topological Space

Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations

Interconnect Technology and Computational Speed

Framework for Design of Dynamic Programming Algorithms

Flexible Coloring. Xiaozhou Li a, Atri Rudra b, Ram Swaminathan a. Abstract

Characterization of Boolean Topological Logics

An Efficient Method for Constructing a Distributed Depth-First Search Tree

On the Component Number of Links from Plane Graphs

Multi-path Routing for Mesh/Torus-Based NoCs

Prefix Computation and Sorting in Dual-Cube

Sparse Hypercube 3-Spanners

Minimizing Total Communication Distance of a Time-Step Optimal Broadcast in Mesh Networks

POWER DOMINATION OF MIDDLE GRAPH OF PATH, CYCLE AND STAR

A WILD CANTOR SET IN THE HILBERT CUBE

Subdivided graphs have linear Ramsey numbers

Authors Abugchem, F. (Fathi); Short, M. (Michael); Xu, D. (Donglai)

However, this is not always true! For example, this fails if both A and B are closed and unbounded (find an example).

INTRODUCTION TO THE HOMOLOGY GROUPS OF COMPLEXES

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter

Multiple Vertex Coverings by Cliques

Node-Independent Spanning Trees in Gaussian Networks

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

Open and Closed Sets

The Monge Point and the 3(n+1) Point Sphere of an n-simplex

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

The Postal Network: A Versatile Interconnection Topology

Unlabeled equivalence for matroids representable over finite fields

New Constructions of Non-Adaptive and Error-Tolerance Pooling Designs

1 Elementary number theory

Discrete Applied Mathematics

FUTURE communication networks are expected to support

Assignment 4 Solutions of graph problems

EXTERNAL VISIBILITY. 1. Definitions and notation. The boundary and interior of

Interleaving Schemes on Circulant Graphs with Two Offsets

Fixed points of Kannan mappings in metric spaces endowed with a graph

Fountain Codes Based on Zigzag Decodable Coding

FACES OF CONVEX SETS

MA651 Topology. Lecture 4. Topological spaces 2

Division of the Humanities and Social Sciences. Convex Analysis and Economic Theory Winter Separation theorems

MOURAD BAÏOU AND FRANCISCO BARAHONA

Cache-Oblivious Traversals of an Array s Pairs

Transcription:

The Computer Journal, 46(6, c British Computer Society 2003; all rights reserved Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes Tori KEQIN LI Department of Computer Science, State University of New York, New Paltz, NY 12561, USA Email: lik@newpaltzedu A divisible load distribution algorithm on k-dimensional meshes tori is proposed analyzed It is found that by using our algorithm, the speed-up of parallel processing of a divisible load on k-dimensional meshes tori is bounded from above by a quantity independent of network size, due to communication overhead limited network connectivity In particular, it is shown that for k-dimensional meshes tori, as the network size becomes large, the asymptotic speed-up of processing divisible loads with corner initial processors is approximately β 1 1/2k,whereβ is the ratio of the time for computing a unit load to the time for communicating a unit load It is also proved that by choosing interior initial processors, an asymptotic speed-up of 2 k β 1 1/2k can be achieved Received 29 October 2002; revised 13 February 2003 1 INTRODUCTION A divisible load has the property that it can be arbitrarily divided into small load fractions which are assigned to parallel processors for processing Example applications include large-scale data file processing, signal image processing, scientific numerical computing, finiteelement engineering computations, database multimedia applications, many real-time computing problems such as target identification, searching data processing in distributed sensor networks [1] The problem of divisible load distribution on parallel distributed computing systems with static interconnection networks was first proposed by Cheng Robertazzi in 1988 [2] Since then, divisible load distribution, scheduling processing have been investigated by a number of researchers for the bus [3, 4], linear array [5], tree [3, 6, 7], two-dimensional mesh [8], two-dimensional toroidal mesh [9], three-dimensional mesh [10], hypercube [11] partitionable [12, 13] networks Other studies can be found in [14, 15, 16, 17] 1 The well-known Amdahl s Law [18] states that if fraction f of a computation is sequential cannot be parallelized at all, the speed-up is bounded from above by 1/f no matter how many processors are used For a divisible load, there is no inherently sequential part, that is f 0 However, this does not imply that unbounded speed-up can be achieved The reason is that Amdahl s Law has no restriction on a parallel system, where processors can communicate with each other without cost When a divisible load is processed on a multicomputer with a static interconnection 1 The reader is also referred to the Website http://wwwecesunysbedu/ tom/dlthtml for more references in this field network, there is communication overhead for distributing load among the processors Also, the network topology, that determines the speed at which a divisible load is distributed over a network, has a strong impact on performance (ie parallel processing time speed-up In this paper, we propose a divisible load distribution algorithm on k-dimensional meshes tori analyze the parallel time speed-up of the algorithm We derive a recurrence relation so that the ultimate parallel processing time asymptotic speed-up can be easily calculated for k-dimensional meshes tori It is found that by using our algorithm, the speed-up of parallel processing of a divisible load on k-dimensional meshes tori is bounded from above by a quantity independent of network size, due to communication overhead limited network connectivity In particular, it is shown that for k-dimensional meshes tori, as the network size becomes large, the asymptotic speed-up of processing divisible loads with corner initial processors is approximately β 1 1/2k,where β is the ratio of the time for computing a unit load to the time for communicating a unit load We also prove that by choosing interior initial processors, an asymptotic speed-up of 2 k β 1 1/2k can be achieved [19] The significance of our research is three-fold First, these results include the earlier results for linear arrays two-dimensional meshes in [13] as special cases Second, this paper provides a unified treatment for divisible load distribution on k-dimensional meshes tori for all k 1 Third, divisible load distribution on k-dimensional meshes tori with k>3has never been addressed before our work gives an initial investigation on these networks

626 K LI 2 THE MODEL We consider parallel processing of a divisible load on a multicomputer system with N processors P 1,P 2,,P N connected by a static interconnection network Each processor P i has n i neighbors It is assumed that P i has n i separate ports for communication with each of the n i neighbors; ie processor P i can send messages to all its n i neighbors simultaneously Once a processor sends a load fraction to a neighbor, it can proceed with other computation communication activities This provides the capability to overlap computation communication enhances the system performance However, a neighbor (receiver must wait until a load fraction arrives before it starts to process the load fraction It is this waiting time that limits the overall system performance Let T cm be the time to transmit a unit load along a link The time to send a load to a neighbor is proportional to the size of the load, with a negligible communication startup time Let be the time to process a unit load on a processor Again, the computation time is proportional to the size of a load We use β /T cm to denote the computation granularity, which is a parameter indicating the nature of a parallel computation a parallel architecture A large (small β gives a small (large communication overhead A computation intensive load has a large β a communication intensive load has a small β An infinite β implies that the communication cost is negligible 3 THE LOAD DISTRIBUTION ALGORITHM A k-dimensional mesh M k can be specified by k 1 positive integers N 1,N 2,,N k,wheren r 2 is the size in the rth dimension of the mesh, 1 r k M k has a set of N N 1 N 2 N k processors, P k {P j1,j 2,,j k 1 j r N r, 1 r k} Each processor P j1,j 2,,j k has neighbors P j1,,j r ±1,,j k if they exist Assume that there is a load x initially on processor P N1,N 2,,N k, called the initial processor The load is to be distributed over all the N processors of a k- dimensional mesh for parallel processing We now describe our algorithm A N1,N 2,,N k for processing divisible loads on a k-dimensional mesh M k of size N N 1 N 2 N k For notational convenience, a single processor is treated as a zero-dimensional mesh M 0 with a one-processor set P 0 Our algorithm A N1,N 2,,N k for processing a divisible load x on a k-dimensional mesh M k works as follows (A1 When N 1, the single processor processes load x by itself (A2 In general, when N > 1, the initial processor P N1,N 2,,N k sends a faction α of the load x to processor P N1,N 2,,N k 1 The remaining load (1 αx is processed by the (k 1-dimensional mesh M k 1 of size N 1 N 2 N k 1 with the set of processors P k 1 {P j1,j 2,,j k 1,N k 1 j r N r, 1 r k 1} by using the load distribution algorithm A N1,N 2,,N k 1 (A3 If N k > 2, processor P N1,N 2,,N k 1 is regarded as the initial processor of the k-dimensional mesh M k of size N 1 N 2 (N k 1 with the set of processors P k {P j 1,j 2,,j k 1 j r N r, 1 r k 1, 1 j k N k 1} Upon receiving the load αx by processor P N1,N 2,,N k 1, thek-dimensional mesh M k processes the load αx by using the load distribution algorithm A N1,N 2,,N k 1 (A4 If N k 2, processor P N1,N 2,,N k 1 is regarded as the initial processor of the (k 1-dimensional mesh M k 1 of size N 1 N 2 N k 1 with the set of processors P k 1 {P j 1,j 2,,j k 1,N k 1 1 j r N r, 1 r k 1} Upon receiving the load αx by processor P N1,N 2,,N k 1, the (k 1-dimensional mesh M k 1 processes the load αx by using the load distribution algorithm A N1,N 2,,N k 1 A k-dimensional torus is similar to a k-dimensional mesh except that each processor P j1,j 2,,j k has neighbors P j1,,(j r ±1 mod N r,,j k Since a k-dimensional torus contains a k-dimensional mesh as its subnetwork, algorithm A N1,N 2,,N k is also applicable for load distribution on k-dimensional tori 4 PARALLEL TIME AND SPEED-UP Let T N1,N 2,,N k denote the parallel time for processing one unit of load on a k-dimensional mesh M k of size N N 1 N 2 N k by using the load distribution algorithm A N1,N 2,,N k Since both computation communication times are linearly proportional to the amount of load, the time for processing x units of load on a k-dimensional mesh M k of size N N 1 N 2 N k is xt N1,N 2,,N k for all x 0 The speed-up S N1,N 2,,N k is defined as the ratio of the sequential processing time to the parallel processing time, namely T 1 S N1,N 2,,N k T N1,N 2,,N k T N1,N 2,,N k We are particularly interested in T,,, S,,, lim N 1,N 2,,N k T N 1,N 2,,N k lim S N 1,N 2,,N k, N 1,N 2,,N k

SPEED-UP OF PARALLEL PROCESSING OF DIVISIBLE LOADS 627 ie the ultimate parallel processing time the asymptotic speed-up when the size of a k-dimensional mesh goes to infinity We use T S to represent T,,, S,,, respectively, where there are k s We now prove the main result of the paper THEOREM 41 For k-dimensional meshes, we have T 1 ( 4 T cm + Tcm 2 T 1 ( 4T (k 1 T cm + Tcm 2, k 2 Furthermore, we have S β 1 1/2k for large β all k 1 Proof The value T N1,N 2,,N k can be obtained recursively as follows First, by (A1, we have T 1 In general, when N > 1, processor P N1,N 2,,N k can proceed without waiting after sending the load fraction α in (A2 Hence, the time spent by M k 1 is (1 αt N1,N 2,,N k 1 In(A3,it takes αt cm time for the load fraction α to reach processor P N1,N 2,,N k 1 Then M k requires αt N 1,N 2,,N k 1 time to process the load fraction α To minimize the parallel processing time T N1,N 2,,N k, we need to make sure that both M k 1 M k spend the same amount of time, ie This implies that T N1,N 2,,N k (1 αt N1,N 2,,N k 1 α(t cm + T N1,N 2,,N k 1 α T N1,N 2,,N k 1 T N1,N 2,,N k 1 + T N1,N 2,,N k 1 + T cm Hence, T N1,N 2,,N k satisfies the following recurrence relation, T 1 (1 T N1,N 2,,N k T N 1,N 2,,N k 1 (T N1,N 2,,N k 1 + T cm, T N1,N 2,,N k 1 + T N1,N 2,,N k 1 + T cm N>1 (2 Taking the limit on both sides of Equation (2, we obtain T (T + T cm, k 1 T + + T cm T T (k 1 (T + T cm, k 2 T + T (k 1 + T cm That is, T 2 + T cmt T cm 0, k 1 T 2 + T cmt T (k 1 T cm 0, k 2 Solving these quadratic equations, we get T T with k 2inthetheorem As for asymptotic speed-up, we prove by induction on k 1thatS β 1 1/2k for large β When k 1, we note that S T 2 4 T cm + T 2 cm T cm 4β + 1 1 β for large β Whenk 2, we have S T 2 4T (k 1 T cm + T 2 cm T cm 4(T(k 1 /T cm + 1 1 By the induction hypothesis, ie S (k 1 /T cm T (k 1 T (k 1 /T cm β β 1 1/2k 1, T (k 1 /T cm we know that T (k 1 /T cm β 1/2k 1 for large β Therefore, S β 1 1/2k, (3 4β 1/2k 1 + 1 1 for large β This proves the theorem The following corollaries are immediate consequences of Theorem 41 for two- three-dimensional meshes COROLLARY 41 For two-dimensional meshes, we have T, 1 ( 4T T cm + Tcm 2 ( 1 2 Furthermore, S, 2T cm 4 T cm + T 2 cm T 2 cm T cm β 3/4, T, 2 4β + 1 1 1 for large β COROLLARY 42 For three-dimensional meshes, we have T,, 1 ( 4T, T cm + Tcm 2 1 2T cm 2T cm 4 T cm +Tcm 2 2 T cm 2 T cm 2 T cm

628 K LI Furthermore, S,, T,, 2 2 β 7/8 4β + 1 1 1 1 for large β It is clear that Theorem 41 also holds for k-dimensional tori COROLLARY 43 For k-dimensional tori, we have T 1 ( 4 T cm + Tcm 2 T 1 ( 4T (k 1 T cm + Tcm 2, k 2 Furthermore, we have S β 1 1/2k for large β all k 1 In Table 1, we demonstrate numerical values of asymptotic speed-up S For each pair of β k, we give three values of S The first value is the exact value of S calculated by using Theorem 41 The second third values are estimations of S in Equation (3 These estimations are more accurate for small k large β than for large k small β 5 PERFORMANCE IMPROVEMENT Improved speed-up can be achieved by placing the initial load on an interior processor instead of corner boundary processors A submesh M k of a k-dimensional mesh M k of size N N 1 N 2 N k contains processors P k {P j 1,j 2,,j k a r j r b r, 1 r k}, where a r <b r for all 1 r k b r a r +1isthesizein the rth dimension of the submesh A processor P j1,j 2,,j k is called a boundary processor of M k in dimension r if j r a r or j r b r, a corner processor of M k if j r a r or j r b r for all 1 r k A processor P j1,j 2,,j k is called an interior processor of M k in dimension r if a r <j r <b r We say that a k-dimensional mesh M k is split at s r in dimension r if M k is divided into two disjoint submeshes M k containing processors P k {P j 1,j 2,,j k 1 j r N r,r r, 1 j r s r }, M k containing processors P k {P j 1,j 2,,j k 1 j r N r,r r, s r + 1 j r N r } If a processor P j1,j 2,,j k is an interior processor of a k-dimensional mesh M k in dimensions r 1,r 2,,r m, it will eventually become a corner processor of a submesh by splitting M k for m times Let T (m N 1,N 2,,N k denote the parallel time for processing one unit of load on a k-dimensional mesh M k of size N N 1 N 2 N k when the initial processor is an interior processor in m of the k dimensions, where 0 m k T (0 N 1,N 2,,N k is actually T N1,N 2,,N k Define T,,, (m lim T (m N 1,N 2,,N k N 1,N 2,,N k S,,, (m lim N 1,N 2,,N k S(m N 1,N 2,,N k, with abbreviation T (m S(m THEOREM 51 For k-dimensional meshes tori, we have T (m T /2 m S (m 2m β 1 1/2k for all 0 m k When the initial processor is an interior processor in all the k dimensions, the asymptotic speed-up of 2 k β 1 1/2k can be achieved Proof When m 1, the initial processor P N1,,j r,,n k is an interior processor in one dimension r The initial processor sends a fraction α of load x to one of its neighbors in dimension r, say,p N1,,j r +1,,N k The initial processor the selected neighbor are initial processors of two separate k-dimensional meshes M k M k obtained by splitting M k at j r in the rth dimension These two submeshes process the loads αx (1 αx respectively by using algorithm A N1,N 2,,N k It is clear that T (1 (1 αt α(t + T cm, which yields T (1 T (T + T cm T (β 1/2k + 1 2T + T cm 1/2k + 1 T 2 S (1 /T (1 1 1/2k for large β When m>1, the initial processor is an interior processor in dimensions r 1,r 2,,r m Thek-dimensional mesh M k is first split at j r1 in dimension r 1 The two resulted submeshes are further split in dimension r 2, the four resulted submeshes are further split in dimension r 3 so on It is not difficult to see that T (m T (m (1 αt (m 1 T (m 1 T (m (m 1 α(t + T cm (m 1 (T 2T (m 1 + T cm + T cm T (m 1 /2forall1 m k Hence, T (m T /2 m,s (m /T (m 2m ( /T 2 m β 1 1/2k In Table 2, we demonstrate numerical values of asymptotic speed-up S (m where k 3 It can be seen that the doubling effect S (m 2S(m 1 is stronger for small m large β than for large m small β

SPEED-UP OF PARALLEL PROCESSING OF DIVISIBLE LOADS 629 TABLE 1 Numerical values of asymptotic speed-up S β 1 2 3 4 5 6 7 1 1618 2317 3071 3865 4690 5537 6401 1618 1618 1618 1618 1618 1618 1618 1000 1000 1000 1000 1000 1000 1000 2 2000 3236 4633 6142 7731 9379 11073 2000 2532 2858 3040 3136 3186 3211 1414 1682 1834 1915 1957 1978 1989 5 2791 5384 8537 12073 15875 19870 24008 2791 4644 6089 7007 7526 7802 7945 2236 3344 4089 4522 4755 4876 4938 10 3702 8210 14053 20806 28188 36015 44169 3702 7423 10820 13186 14594 15363 15765 3162 5623 7499 8660 9306 9647 9822 20 5000 12808 23642 36572 50933 66297 82391 5000 11954 19272 24831 28304 30253 31286 4472 9457 13753 16585 18213 19085 19537 50 7589 23640 48175 78759 113465 151029 190640 7589 22668 41472 57380 67961 74102 77414 7071 18803 30662 39155 44246 47035 48495 100 10512 38103 83652 142397 210155 284122 362500 10512 37016 74226 108204 131860 145935 153630 10000 31623 56234 74989 86596 93057 96466 200 14651 61950 146515 259456 391874 537647 692847 14651 60722 133098 204164 255877 287416 304883 14142 53183 103134 143620 169482 184110 191890 500 22866 118968 310527 578785 900245 1258041 1641289 22866 117507 288800 472992 614813 704124 754439 22361 105737 229932 339066 411744 453731 476304 1000 32127 196021 551472 1067886 1697120 2403287 3163080 32127 194341 519882 893603 1193486 1386886 1497246 31623 177828 421697 649382 805842 897687 947464 2000 45224 324207 983500 1977959 3210240 4604602 6111473 45224 322265 937253 1689128 2317166 2731806 2971438 44721 299070 773395 1243700 1577149 1776035 1884693 5000 71212 633377 2124222 4489820 7487879 10 917256 14 644645 71212 631009 2046989 3922265 5571297 6693664 7353204 70711 594604 1724244 2936192 3831574 4376970 4678125 10 000 100501 1054012 3816058 8373410 14 249637 21 026595 28 424059 100501 1051249 3701562 7422609 10 820441 13 185993 14 593534 100000 1000000 3162278 5623413 7498942 8659643 9305720 k 6 NOTES ON RELATED WORK Performance limits to parallel processing of divisible loads on static interconnection networks have been observed previously Asymptotic performance analyses for linear arrays were conducted in [20] The special cases of Theorems 41 51 where k 1 for linear arrays are essentially similar to those in [21], the special cases of Theorems 41 51 where k 2 for twodimensional meshes were obtained in [12] An infinite twodimensional mesh with the initial processor in the center was considered in [8] However, our study deals with finite meshes It was shown in [10] that the speed-up of O(β can be achieved in three-dimensional meshes The result is

630 K LI TABLE 2 Numerical values of asymptotic speed-up S (m (k 3 β 0 1 2 3 1 3071 3825 4618 5440 2 4633 6030 7532 9112 5 8537 11690 15192 18954 10 14053 19895 26550 33814 20 23642 34477 47134 61176 50 48175 72710 102337 135926 100 83652 129201 185571 250553 200 146515 231080 338290 463980 500 310527 502086 752607 1053023 1000 551472 906922 1382517 1962793 2000 983500 1642793 2544734 3664595 5000 2124222 3615067 5713174 8379599 10 000 3816058 6578103 10 546049 15 678934 obtained by adopting the circuit-switched routing technique which assumes that communication times are independent of the distances among processors Note that part (A4 of algorithm A N1,N 2,,N k is not related to the analysis in this paper The reason is that, in this paper, we increase the network size N by fixing k increasing the sizes of all the dimensions It is also possible to increase N by fixing N 1,N 2,,N k increasing the number of dimensions k For instance, when all the N r sarefixed at 2 N increases as k increases, we get hypercubes For hypercubes, we use the (k 1-dimensional mesh M k 1 in (A2 the (k 1-dimensional mesh M k 1 in (A4 to process the load fractions (1 αx αx, respectively Therefore, the analysis of parallel time speed-up for processing divisible loads on hypercubes follows a different direction [13] 7 CONCLUDING REMARKS We have proposed a divisible load distribution algorithm on k-dimensional meshes tori analyzed the parallel time speed-up of the algorithm We have shown that by using our algorithm on k-dimensional meshes tori, as the network size becomes large, the asymptotic speed-up of processing divisible loads with corner initial processors is approximately β 1 1/2k We have also proved that by choosing interior initial processors, the asymptotic speed-up of 2 k β 1 1/2k can be achieved The improved speed-up for large k is due to the increased network connectivity which yields a faster speed for load distribution Our work includes earlier results of linear arrays two-dimensional meshes as special cases, provides a unified treatment for divisible load distribution on k-dimensional meshes tori for all k 1 gives an initial investigation of divisible load distribution on k-dimensional meshes tori with k>3 m ACKNOWLEDGEMENTS The author wishes to express his gratitude to two anonymous reviewers for their criticism comments This material is based upon work supported by the US National Science Foundation under Grant No CCR-0091719 Any opinions, findings conclusions or recommendations expressed in this material are those of the author do not necessarily reflect the views of the National Science Foundation REFERENCES [1] Bharadwaj, V, Ghose, D, Mani, V Robertazzi, T G (1996 Scheduling Divisible Loads in Parallel Distributed Systems IEEE Computer Society Press, Los Alamitos, CA [2] Cheng, Y C Robertazzi, T G (1988 Distributed computation with communication delays IEEE Trans Aerospace Electron Syst, 24, 700 712 [3] Bataineh, S, Hsiung, T-Y Robertazzi, T G (1994 Closed form solutions for bus tree networks of processors load sharing a divisible job IEEE Trans Computers, 43, 1184 1196 [4] Sohn, J Robertazzi, T G (1996 Optimal divisible job load sharing for bus networks IEEE Trans Aerospace Electron Syst, 32, 34 40 [5] Mani, V Ghose, D (1994 Distributed computation in linear networks: closed-form solutions IEEE Trans Aerospace Electron Syst, 30, 471 483 [6] Barlas, G D (1998 Collection-aware optimum sequencing of operations closed-form solutions for the distribution of a divisible load on arbitrary processor trees IEEE Trans Parallel Distributed Syst, 9, 429 441 [7] Cheng, Y C Robertazzi, T G (1990 Distributed computation for a tree network with communication delays IEEE Trans Aerospace Electron Syst, 26, 511 516 [8] Błażewicz, J Drozdowski, M (1996 The performance limits of a two-dimensional network of load sharing processors Found Comput Decision Sci, 21, 3 15 [9] Błażewicz, J, Drozdowski, M, Guinard, F Trystram, D (1999 Scheduling a divisible task in a two-dimensional toroidal mesh Discrete Appl Math, 94, 35 50 [10] Drozdowski, M Głazek, W (1999 Scheduling divisible loads in a three-dimensional mesh of processors Parallel Computing, 25, 381 404 [11] Błażewicz, J Drozdowski, M (1995 Scheduling divisible jobs on hypercubes Parallel Computing, 21, 1945 1956 [12] Li, K (1998 Managing divisible load on partitionable networks In Schaeffer, J (ed, High Performance Computing Systems Applications, pp 217 228 Kluwer Academic Publishers, Boston, MA [13] Li, K (2003 Parallel processing of divisible loads on partitionable static interconnection networks Cluster Computing (Special Issue on Divisible Load Scheduling, 6, 47 55 [14] Błażewicz, J Drozdowski, M (1997 Distributed processing of divisible jobs with communication startup costs Discrete Appl Math, 76, 21 41 [15] Błażewicz, J, Drozdowski, M Markiewicz, M (1999 Divisible task scheduling concept verification Parallel Computing, 25, 87 98

SPEED-UP OF PARALLEL PROCESSING OF DIVISIBLE LOADS 631 [16] Ko, K (2000 Scheduling Data Intensive Parallel Processing in Distributed Networked Environments PhD dissertation Department of Electrical Computer Engineering, State University of New York, Stony Brook, New York [17] Sohn, J, Robertazzi, T G Luryi, S (1998 Optimizing computing costs using divisible load analysis IEEE Trans Parallel Distrib Syst, 9, 225 234 [18] Amdahl, G M (1967 Validity of the single processor approach to achieving large scale computing capabilities In Proc AFIPS Spring Joint Computer Conf, Vol 30, pp 483 485 CSREA Press [19] Li, K (2002 Speedup of parallel processing of divisible loads on k-dimensional meshes tori In Proc Int Conf on Parallel Distributed Processing Techniques Applications, Las Vegas, Nevada, June 24 27, pp 171 177 [20] Ghose, D Mani, V (1994 Distributed computation with communication delays: asymptotic performance analysis J Parallel Distrib Comput, 23, 293 305 [21] Bataineh, S Robertazzi, T G (1992 Ultimate performance limits for networks of load sharing processors In Proc Conf Information Sciences Systems, pp 794 799 Princeton University Press, Princeton, NJ