PPDD: scheduling multi-site divisible loads in single-level tree networks

Size: px

Start display at page:

Download "PPDD: scheduling multi-site divisible loads in single-level tree networks"

Myrtle Copeland
5 years ago
Views:

1 Cluster Comput (2010) 13: DOI /s PPDD: scheduling multi-site divisible loads in single-level tree networks Xiaolin Li Bharadwaj Veeravalli Received: 2 July 2007 / Accepted: 1 September 2009 / Published online: 1 October 2009 SpringerScience+BusinessMedia,LLC2009 Abstract This paper investigates scheduling strategies for divisible jobs/loads originating from multiple sites in hierarchical networks with heterogeneous processors and communication channels. In contrast, most previous work in the divisible load scheduling theory (DLT) literature mainly addressed scheduling problems with loads originating from a single processor. This is one of the first works that address scheduling multiple loads from multiple sites in the DLT paradigm. In addition, scheduling multi-site jobs is common in Grids and other general distributed systems for resource sharing and coordination. An efficient static scheduling algorithm PPDD (Processor-set Partitioning and Data Distribution Algorithm) is proposed to near-optimally distribute multiple loads among all processors so that the overall processing time of all jobs is minimized. The PPDD algorithm is applied to two cases: when processors are equipped with front-ends and when they are not equipped with frontends. The application of the algorithm to homogeneous systems is also studied. Further, several important properties exhibited by the PPDD algorithm are proven through lemmas. To implement the PPDD algorithm, we propose a communication strategy. In addition, we compare the performance of the PPDD algorithm with a Round-robin Scheduling Algorithm (RSA), which is most commonly used. Ex- X. Li ( ) Computer Science Department, Oklahoma State University, 219 MSCS, Stillwater, OK 74078, USA xiaolin@cs.okstate.edu B. Veeravalli Department of Electrical and Computer Engineering, The National University of Singapore, 4 Engineering Drive 3, , Singapore, Republic of Singapore elebv@nus.edu.sg tensive case studies through numerical analysis have been conducted to verify the theoretical findings. Keywords Divisible load theory Heterogeneous computing Load scheduling Grid computing Single-level tree networks 1Introduction Parallel and distributed heterogeneous computing has become an efficient solution methodology for various realworld applications in science, engineering, and business [1 4]. One of the key issues is how to partition and schedule jobs/loads that arrive at processing nodes among the available system resources so that the best performance is achieved with respect to the finish time of all input tasks. To efficiently utilize the computing resources, researchers have contributed a large amount of load/task scheduling and balancing strategies in the literature [1, 5 7]. Recent efforts have been focused on resource sharing and coordination across multi-site resources (e.g. multiple supercomputer centers or virtual organizations). For divisible load scheduling problems, research since 1988 has established that the optimal workload allocation and scheduling to processors and links can be solved through the use of a very tractable linear model formulation, referred to as Divisible Load Theory (DLT) [6]. DLT features easy computation, a schematic language, equivalent network element modeling, results for infinite sized networks and numerous applications. This theoretical formulation opens up attractive modeling possibilities for systems incorporating communication and computation issues, as in parallel, distributed, and Grid environments. Here, the optimality, involving solution time and speedup, is derived in

2 32 Cluster Comput (2010) 13: the context of a specific scheduling policy and interconnection topology. The formulation usually generates optimal solutions via a set of linear recursive equations. In simpler models, recursive algebra also produces optimal solutions. The model takes into account the heterogeneity of processor and link speeds as well as relative computation and communication intensity. DLT can model a wide variety of approaches with respect to load distribution (sequential or concurrent), communications (store and forward and virtual cut-through switching) hardware availability (presence or absence of front end processors). Front end processors allow a processor to both communicate and compute simultaneously by assuming communication duties. A recent survey of DLT research can be found in [8]. The DLT paradigm has been proven to be remarkably flexible in handling a wide range of applications. 1.1 Related work Since the early days of DLT research, the research has spanned from addressing general optimal scheduling problems on different network topologies to various scenarios with practical constraints, such as time-varying channels [9], minimizing cost factors [10], resource management in Grid environments [11, 12], and distributed image processing [13]. Thorough surveys of DLT can be found in [5, 6, 14, 15]. Load partitioning of intensive computations of large matrix-vector products in a multicast bus network was theoretically investigated in [16]. Research efforts after 1996 particularly started focusing on including practical issues such as, scheduling multiple divisible loads [17], scheduling divisible loads with arbitrary processor release times in linear networks [18], consideration of communication startup time [19, 20], buffer constraints [21]. Some of the proposed algorithms were tested using experiments on real-life application problems such as image processing [13], matrix-vector product computations [22], and database operations [23]. Various experimental works have been done using the divisible load paradigm such as [22] for matrix-vector computation on PC clusters and [23]forother applications on a network of workstations (NOWs). Recent work in DLT also attempted to use adaptive techniques when computation needs to be performed under unknown speeds of the nodes and the links [24]. This study used bus networks as the underlying topology. Beaumont et al. consolidates the results for single-level tree and bus topologies and presents extensive discussions on some open problems in this domain [25]. A few new applications and solutions in DLT have been investigated in recent years, e.g., bioinformatics [26], multimedia streaming [27], sensor networks [28, 29], economic and game-theoretic approaches [30, 31]. Although most of the contributions in DLT literature consider only a single load originated at one processor [14, 15], scheduling multiple loads has been considered in [32] and [17]. Work presented in [33] considersprocessingdivisible loads originating from an arbitrary site on a arbitrary graph. However, they considered merely a single-site multi-load scheduling problem and didn t address multiple loads originated at arbitrary multiple sites/nodes in networks. Multisite multi-load scheduling is a practical situation, e.g., multiple jobs submitted to multiple sites in Grids. The point of load origination does impose a significant influence on the performance. In addition, when one considers multiple loads originating from several nodes/sites, it becomes much more challenging to design efficient scheduling strategies. One relevant paper to the context of the problem addressed in our work is [34]. This study investigated load scheduling and migration problems without synchronization delays in a bus network by assuming that all processors have front-ends and the communication channel can be dynamically partitioned. Front-ends are communication coprocessors, handling communication without involving processors so that communication and computation can be fully overlapped and concurrent [6]. In this case, load distribution without any consideration of synchronization delay is quite straightforward as will be shown later. However, in practice, it would be unreasonable to assume that the channel can be dynamically partitioned. In addition, we shall also consider the case when processors are not equipped with frontends. Especially, in the application of distributed sensor systems, front-end modules may be absent from the processing elements [6]. Recently, [35] investigatesthecaseoftwo load origination sources in a linear daisy chain architecture. The divisible load scheduling problems with multiple load sources in Grid environments have been studied in [11, 36]. In this paper, we consider a general load scheduling and balancing problem with multiple loads originating from multiple processors in a single-level tree network. This scenario happens commonly in realistic situations, such as applications in distributed real-time systems, collaborative grid systems (where each virtual organization can be abstracted as aresourcesiteoralocalhierarchicalnetwork),andingeneral load balancing and sharing applications [1, 2, 37]. In Grid environments, our proposed model can be applied to the following scenario: we have a super-scheduler across multiple sites and local schedulers for each site; multiple jobs are submitted to local schedulers and possibly partitioned and migrated across multiple sites by the superscheduler for resource sharing, load balancing, and high performance/throughput. 1.2 Our contributions The contributions in this paper are as follows. Primary motivation of this work stems from the fact that in a realworld scenario, there could be multiple loads submitted for

3 Cluster Comput (2010) 13: Fig. 1 A single-level tree network with multiple loads processing on networks originating from several geographically distributed sites, such as in Grid computing environments [1]. While multiple loads processing has been studied in the DLT literature [17, 32], these studies focus towards bus networks and that all the loads are available at the root (bus-controller unit) a priori. We regard these as singlesite multiple jobs problems. The study in this paper is different in formulation and attempts to provide a generalized framework. We formulate the load scheduling problem with multiple loads originating from multiple sites in single-level tree networks. 1 For the cases with and without front-ends, we design a scheduling strategy, referred to as Processorset Partitioning and Data Distribution Algorithm (PPDD), to achieve the near-optimal processing time of all loads. Several significant properties of PPDD algorithm are proven in lemmas. Detailed analysis of the time performance of PPDD algorithm is conducted. In order to actually implement the load distribution obtained through PPDD algorithm, we propose a load communication strategy. In addition, we compare the time performance of PPDD algorithm with another algorithm, referred to as Round-robin Scheduling Algorithm (RSA). It is demonstrated that the proposed PPDD algorithm produces better scheduling solutions than RSA. We testify all these findings via detailed numerical examples on heterogeneous system of processors. The contributions in this paper are expected to spur further research in this direction, especially useful while considering scheduling loads on arbitrary networks from multiple sites. This paper is organized as follows. We first formulate the problem and present some notations in Sect. 2. InSect.3, we first consider load partitioning strategies for cases with and without front-ends. Then, we present the communication strategies for these cases in Sect. 4. Weprovesome important results to analyze the performance of the algorithms. In Sect. 5,wediscussindetailandcomparethetime performance of PPDD algorithm and RSA. Section 6 con- 1 It may be noted that our formulation holds for a bus network topology which is a special case of a single-level tree network. cludes the paper and presents some possible extensions to this work. 2Problemformulationandsomenotations This section first introduces the network architecture and then presents the definitions, notations, and terminology to be used throughout the paper. As shown in the Fig. 1, we consider a single-level tree network with a root processor p 0,alsoreferredtoasaschedulerforthesystem,and m processors denoted as p 1,...,p m connected via links l 1,...,l m,respectively.weassumethattheschedulerisonly in charge of collecting the load status on each processor and routing the loads from one processor to another processor, and it does not participate in processing any load. In other words, p 0 works like a router. Initially, each processor is assumed to have a load to be processed. The objective is to minimize the overall processing time of all the loads submitted to the system (at various processors). If we do not schedule each of the loads among the set of processors, then the overall processing time of all the loads is determined by the time when the last processor finishes processing its own load. In order to minimize the overall finish time, we should carefully re-schedule and balance the loads among all processors. Also, the scheduling strategy must be such that a faster processor will process more loads while aslowerprocessorwillprocesslessloads.sinceprocessors, links, and the size of the load originating at various processors are heterogeneous (non-identical), it becomes a complex problem to obtain an optimal solution. In the load balancing literature [38], the basic rationale is to balance loads in such a way that some load fractions from overloaded processors are transferred to under-loaded processors and all the processors more-or-less have identical processing time of loads assigned to them. Here too, we follow the same strategy as a basic mechanism in balancing the divisible loads among the processors. We introduce some notations and terminology that will be used throughout the paper as follows.

4 34 Cluster Comput (2010) 13: E i :Thetimeittakestocomputeaunitloadbyprocessor p i, i = 1,...,m. C i :Thetimeittakestotransmitaunitloadonlinkl i, i = 1,...,m. L i :Theamountofloadoriginatingatp i for processing as shown in Fig. 1. l i :Theloadassignedtop i according to a scheduling strategy. η: Theloaddistributionobtained.Thisisdefinedas an m-tuple denoting the loads assigned to each p i, and is given by, η ={l 1,l 2,...,l m }.Certainly,the sum of l i, i = 1, 2,...,m,shouldbesameasthe sum of the original loads, that is, L = m i=1 l i = mi=1 L i. L i :Theloadportiontobetransferredfromortoa processor p i, i = 1, 2,...,m. T i (m): Thefinishtimeforprocessingtheloadat p i. T(m): Theoverallprocessingtimeforalltheloadsprocessed by m processors. This is given by T(m) = max m i=1 T i(m). T (m): Theoptimalprocessingtimeof alltheloads. S over :Thesetofprocessorswhichareover-loaded.Processors in this set are the potential senders of excess loads. S under :Thesetofprocessorswhichareunder-loaded. Processors in this set are the potential receivers of loads transferred from the processors in S over. In our formulation, we consider a single-level tree network with m processors and a scheduler p 0.Eachprocessor p i has its own divisible load of size L i to process, and the goal is to design an efficient scheduling strategy to minimize the overall processing time of all the loads (in all the processors) by partitioning and distributing the loads among all the m processors. Note that, the proposed scheduling strategy also takes care of the situation in which there may be only a subset of processors having loads to process. Note that T i (m), theprocessingtimeatp i,isafunctionofe i, C i, l i,and L i.accordingtotheabovedefinitions,fora given load of size L units, its computation time at processor p i is LE i and its communication time over link l i is given by LC i.notethatthecentralschedulerworkslikearouter.on anetworklink,ingeneral,thetimetakenbyaloadtoreach the final destination depends on the slowest link on its path, owing to the available bandwidth on various links comprising the path [39]. Thus, if l i and l j are the links connecting the source and destination nodes and if C i C j,thenwe assume that the communication time taken to reach the destination via the links l i and l j is simply LC j.however,it may be noted that this assumption does not affect the way in which the strategy is designed. In fact, we will show that this assumption eases analytical tractability. Without loss of generality, we index all p i in the order L i E i L i+1 E i+1, i = 1,...,m 1. Thus, without any load partitioning and scheduling, the overall processing time of all the loads is determined by the finish time of the last processor p m,whichisgivenbyt max (m) = L m E m.in the case that L i = 0, for some i [1,m], weshallgroup these processors as an equivalent processor p 0.Thus,for this case, the analysis followed will assume there is at the most one processor which has no load. Since a divisible load is assumed to be computationally intensive, a natural assumption is that the computation time for a given load is much larger than the communication time, that is, E i >C j, i, j = 1, 2,...,m.Moregeneraldiscussionon computation-intensive applications and their computationcommunication ratios can be found in [2, 40]. In addition, in [6], a condition referred to as Rule A for singlelevel tree networks was used to eliminate all the redundant processor-link pairs to obtain an optimal reduced network to achieve the optimal processing time. In our current formulation, with the assumption that E i >C j, i, j = 1, 2,...,m, Rule A is automatically satisfied. The reader is referred to the implications of Rule A as explained in [6]. In the next section, we shall first identify a condition that designates a processor as an over-loaded processor or as an under-loaded processor and form S over and S under sets, respectively. Then, we obtain the exact load fractions L i, p i S under,tobereceivedbyprocessorsins under and L j, p j S over,tobeextractedfromprocessorsins over to minimize the overall processing time. Thus, we obtain the exact load portion assigned to p i as l i = L i + L i, p i S under,or l j = L j L j, p j S over,fori, j {1,...,m}. Fromthe resulting load distribution η, weobtaintheoverallprocessing time of all the loads. 3Loadpartitioningstrategies In this section, we consider two cases, namely when all the processors are equipped with front-ends and when all the processors are not equipped with front-ends. In the case with front-ends, we can improve the processing time performance through efficiently overlapping communication with computation [6]. However, in the case without front-ends, communication and computation for each processor cannot be fully overlapped. and communication delays incurred while redistributing the loads among processors need to be minimized. The strategy involves two phases. In the first phase, the entire set of loads are partitioned. In the second phase, the partitioned loads are transferred from one processor to another following a communication strategy. These two phases will be carried out for both with and without front-end cases. This section focuses on the first phase and the next section will investigate the second phase. The implementation of the proposed PPDD algorithm also involves these two phases. In the first phase, the scheduler p 0 collects all load distribution information about all slave processors and applies the

5 Cluster Comput (2010) 13: PPDD algorithm to obtain the near-optimal load partitions. In the second phase, the set of overloaded processors initiates sending data and the set of under-loaded processors initiates receiving data; the scheduler coordinates these slaves sending or receiving operations by routing data among them. Note that, although PPDD algorithm appears iterative to obtain the near-optimal data partition, the amount of load migration/partition for each processor will be adjusted only once. We assume that each processor has its own load to process initially. A processor can start processing its own load or communicating with other processors from time, say t = 0onwards,asperthedesignofaschedulingstrategy. During some time interval, a processor may not have any load available to process and also may not be engaged in receiving any load from other processors. In this situation, aprocessormaysimplyremainidlewithoutanyactivity. However, this processor may be assigned a load portion at some time later by the scheduler; hence, till that time, this processor will keep idle. We refer this idle time interval as a starvation gap in the rest of the paper. Efficient load balancing strategies are thus expected to minimize these starvation gaps and maximize the system utilization. 3.1 With front-ends For the case with front-ends, consider an ideal situation in which there is no starvation gap. Also, we assume that the entire communication can be overlapped by computation. In other words, a processor will not starve for the data while it is receiving it from other processors and it will be engaged in processing its own load. We refer to this situation as ideal case, hereafter.toachievetheoptimalprocessingtimefor the entire set of loads, we should balance the loads among all the processors such that all the participating processors finish processing at the same time instant. We use this criterion as the optimality condition to determine the optimal solution as in [6]. Intuitively, if there are some processors complete processing earlier and other processors complete processing later, we can reschedule some workload from those late processors to those early processors to reduce the overall processing time (which is determined by the processor that finishes the last). Thus, for the ideal situation mentioned above, the optimal processing time is given by T ideal (m) = mi=1 L i mi=1 1 Ei (1) In the above equation, the numerator is the summation of all the loads and the denominator is the summation of the total processing power available with the system. From (1), we can obtain the load portions to be transferred from/to the nodes as, L L i = i E i Tideal (m) = L i l i, E i i = 1, 2,...,m (2) where l i = T ideal (m)/e i, i = 1, 2,...,m,andl i is the load processed at processor p i after balancing. Note that when L i >l i,processorp i belongs to the set of overloaded nodes S over (senders) and hence, the load at p i should be transferred to other nodes. On the other hand, when L i l i, p i belongs to the set of under-loaded nodes S under (receivers), to which the load from other nodes will be transferred. As mentioned earlier, since we index the processors in the order of minimum E i L i first, we can obtain an integer K uniquely such that p i S under, i = 1, 2,...,K and p i S over, i = K + 1,K + 2,...,m.WerefertoK as a delimiter to separate these receiver and sender sets. Thus, for all p i S under,wehavel i <l i and for all p j S over,we have L j >l j,respectively. The load distribution algorithm is presented in Table 1. We initially use the optimal solution obtained for the ideal case and determine a delimiter K to identify the potential senders and receivers. Then, using (2) wederivetheload fractions to be exchanged, L i, i = 1,...,m,thusobtaining a load distribution η.becauseoftheassumptionthatthe entire communication can be overlapped by computation, we immediately obtain the finish time for each processor as T i (m) = l i E i.sincethealgorithmpartitionstheprocessors into two sets first and then distributes extra loads from the sender set to the receiver set, we refer to this algorithm as Processor-set Partitioning and Data Distribution Algorithm (PPDD). Table 1 PPDD algorithm for the case with front-ends Initial stage: From (1)and(2), we obtain the initial delimiter K which separates the sender and receiver sets. Load distribution: The load assigned to p i is l i = L i + L i, i = 1, 2,...,K and l i = L i L i, i = K + 1,K + 2,...,m. Overall processing time: The finish time for processor p i is given by, T i (m) = l i E i. Thus, we obtain the overall processing time T(m)= max{t i (m)}, i = 1, 2,...,m.

6 36 Cluster Comput (2010) 13: Note that in the above strategy for load distribution, we have not explicitly discussed how the loads are communicated to the respective processors, rather we have discussed how much load a processor is assigned from the entire set of loads. We shall discuss the communication strategy in the next section. However, in practical situations, there may be starvation gaps and not all communication can be overlapped with computation. We will see how this issue is addressed in Sect. 4,whenwediscusstheloadcommunication strategies. 3.2 Without front-ends Table 2 describes the proposed algorithm for finding a load distribution for the case without front-ends in detail. This algorithm operates in three steps. In the first step, an initial solution is obtained by using (1) and(2), then corresponding sender and receiver sets are formed as described in the previous section. Note that (1)and(2) are for the case with front-end. In the second step, the feasibility of the resulting load distribution obtained is validated. When all the resulting L j > 0, j = 1,...,m,arepositive,PPDDalgorithm obtains a feasible delimiter K and stops the iteration. Then it obtains the final load distribution, η using (9). In the last step, following the load distribution obtained above, we calculate the overall processing time of all the loads. Since the basic style of working is identical to the case with front-end, we continue to refer to this algorithm simply as Processor-set Partitioning and Data Distribution Algorithm (PPDD). Table 2 PPDD algorithm for the case without front-ends Initial phase: From (1)and(2), we obtain the initial delimiter K which identifies the potential sender and receiver sets. Iteration phase: We assume that a processor p i is a sender, if i>kor a receiver, if i K. Assuming all processors finish processing at the same time, denoted as T x (m), x = 1,...,m,toachievetheoptimalprocessingtime,wehave, T x (m) = L i E i + L i (C i + E i ), i = 1, 2,...,K, (3) T x (m) = L j E j + L j (C j E j ), j = K + 1,...,m (4) Thus, expressing all the L i in terms of L 1, for the receiver set, we obtain, L i = f i + g i L 1, i = 1, 2,...,K where f i = L 1E 1 L i E i C i +E i and g i = C 1+E 1 C i +E i, i = 1, 2,...,K. For the sender set, we have L i = f i + g i L 1, i = K + 1,K + 2,...,m (6) where f i = L 1E 1 L i E i C i E i and g i = C 1+E 1 C i E i, i = K + 1,K + 2,...,m. Since the sum of loads transferred from the sender set and the sum of loads received by the receiver set must be identical, we have (5) K L i = i=1 m i=k+1 L i Thus, from (5), (6)and(7), the closed-form solution for L 1 is given by, mi=k+1 f i L 1 = K i=1 f i Ki=1 g i m i=k+1 g i (7) (8) Equations (5), (6) and(8) givethesolutionfor L 1, L 2,..., L m,whichshouldallbenon-negative.ifanyresulting L i is negative, we update the receiver set and sender set by moving p K+1 from S over to S under and increase K by 1. We repeat the calculations given by (3)to(8)untilall L 1, L 2,..., L m are non-negative. Overall processing time: The loads assigned to individual processors are given by, l i = { Li + L i = L i + f i + g i L 1, i = 1,...,K L i L i = L i f i + g i L 1, i = K + 1,...,m The finish time of processor p i is given by, { Li E i + (f i + g i L 1 )(C i + E i ), i = 1, 2,...,K T i (m) = L i E i + (f i + g i L 1)(C i E i ), i = K + 1,...,m Thus, the overall processing time is, T(m)= max{t i (m)}, i = 1, 2,...,m. (9) (10)

7 Cluster Comput (2010) 13: It may be noted that since the load distribution for the ideal case serves as a convenient starting point for the case without front-ends, in the above load distribution strategy for without front-ends, we avoid the possibility of iterating from the value of K = 1onwards.Further,theoptimalload distribution for the ideal case will definitely identify a larger sender set than the optimal load scheduling for the case without front-ends, because all the communication can be overlapped with the computation in the ideal case. Also, for the ideal case, we expect a smaller overall processing time than that of the case without front-ends. In addition, we initially designate all the processors as senders whose original processing times (L i E i )aregreaterthantheirexpected overall processing time. This is our basic idea to judiciously use the initial results obtained as the starting point of the algorithm. Thus, we increase K in a step by step fashion to shrink the sender set in order to find the ultimate feasible receiver and sender sets. The initial choice of the K determines the number of iterations needed for the PPDD algorithm. From the equations from (3) through(8), we observe that the algorithm always guarantees that the resulting load distribution will make all the processors finish processing at the same time. We present several significant properties exhibited by the proposed strategy below. Lemma 1 In the case without front-ends, whenever the loads are not balanced, i.e., L i E i L j E j for some i j, there is at least one sender and one receiver such that L i > 0 for all receivers p i S under. Lemma 2 The load distribution strategy takes a finite number of steps to converge. It needs only n<(m K) iterations to obtain a near-optimal solution. The proofs of the above lemmas can be found in the Appendix. Notethat,theabovetwolemmasholdalsofor the case with front-ends. All the properties presented in the above lemmas lead to the following conjecture. Conjecture 1 The load distribution strategy presented in Tables 1 and 2 yield optimal solutions for cases with and without front-ends, respectively. Arigorousproofcanbeattemptedfollowingthetreatment presented in [6]. The basic idea is to use contradictions if we do not follow the solution given by the PPDD algorithm. Due to the uncertainty of processing speed distribution in heterogeneous systems, we have not derived a satisfactory rigorous proof for this conjecture. Our ongoing work is to derive a proof of statistical optimality for the PPDD algorithm. However, we observe that from the workings of the algorithms, any re-distribution from the load scheduling proposed above will cause an imbalance of the loads among all processors and will result in under-utilizing certain processors due to the fact that some processors may be busy in processing while others have finished their tasks, thus increasing the overall processing time. Based on above proofs and observations, we argue that PPDD algorithm yields near-optimal solutions. To see the working steps of this algorithm, we present anumericalexamplewiththefollowingspeedparameters. This is for the case without front-ends. Note that, since Example 1 to 4 are based on numerical analysis, results are stable and deterministic. These examples are used to verify our theoretical analysis and demonstrate certain features of our proposed algorithms more vividly. The range of parameters (normalized processor and link speeds) and computation to communication ratios used in these examples follow the observations and guidelines in [2, 40]. Example 1 In this example, we consider a single-level tree network with m = 5 processors and a root node (central scheduler). The system parameters we had set are, processor speeds E 1 = 50 sec/mb, E 2 = 65 sec/mb, E 3 = 60 sec/mb, E 4 = 45 sec/mb, E 5 = 80 sec/mb, and link speeds C 1 = 0.3 sec/mb, C 2 = 0.2 sec/mb, C 3 = 0.15 sec/mb, C 4 = 0.1 sec/mb,c 5 = 0.55 sec/mb. These parameters are typical for image processing applications [13, 41]. The size of the respective loads injected on each processor is as follows. L 1 = 100 MB, L 2 = 110 MB, L 3 = 120 MB, L 4 = 180 MB, L 5 = 150 MB. We index the processors in the order of the smallest L i E i first, as mentioned before. Note that the original processing time at each processor (calculated using L i E i, i = 1,...,5) is, 5000, 7150, 7200, 8100, and sec, respectively, in an incremental order. Thus, if each processor processes its own load without sharing with other processors, the overall processing time of the entire set of loads is sec and the average processing time is 7890 sec. Using (2), we first obtain the ideal scheduling solution as follows. The sender and receiver set delimiter is K = 3, and hence, the sender set is given by, S over = {p 4,p 5 } and receiver set is given by, S under ={p 1,p 2,p 3 }. The amount of load migration is L 1 = 52.12, L 2 = 7.02, L 3 = 6.77, L 4 = 10.97, L 5 = 54.92, respectively. In the ideal case, the resulting schedule makes all processors finish processing at the same time instant and the overall processing time is given by sec. Following PPDD algorithm, for the case without frontends presented above, using K = 3astheinitialstarting point, after one iteration, we obtain the scheduling solution as follows. The sender and receiver delimiter is still K = 3, thus the sender set is S over ={p 4,p 5 } and receiver

8 38 Cluster Comput (2010) 13: set is S under ={p 1,p 2,p 3 }.Theamountofloadexchanged is L 1 = 51.98, L 2 = 7.13, L 3 = 6.89, L 4 = 10.81, L 5 = 55.19, respectively. Observe that all processors finish processing at the same time and the overall finish time of the entire set of loads is sec. From the above example, we observe that the resulting overall processing time, for the case without frontends, is quite close to the ideal case and is much less than the original processing time without load sharing (less than 36.5%). In addition, the near-optimal finish time obtained, sec, is even better than the average of the original individual processing times of the respective loads, obtained as 7890 sec. Above results clearly elicit the fact that any naive strategy which aims to achieve an average processing time or considers assigning equal size portions among the processors (average of all the loads) will not result in a good solution in heterogeneous computing networks. 3.3 Homogeneous systems without front-ends To gain more insights on the properties of the proposed algorithms, we further conduct analysis of homogeneous systems. Due to the irregular (sometime random) parameters in heterogeneous systems, it is difficult to observe some natural trends of the performance and load distribution following the PPDD algorithm. Homogeneous settings offer us an opportunity to examine some special behaviors of systems and findings in homogeneous systems can be used as a reference or approximation for similar heterogeneous systems. For a homogeneous system, we have all C i = C, i = 1, 2,...,m and all E i = E, i = 1, 2,...,m.Inthiscase,weobserve some interesting special properties exhibited by the load partitioning strategy. Lemma 3 In a homogeneous system, in the near-optimal load distribution obtained using the proposed strategy, we always have, for the receiver set, L i L i+1, i = 1, 2,...,K 1, and for the sender set, L i L i+1, i = K + 1,...,m 1. The proof of this lemma is presented in Appendix. From Lemma 3,weobservethattheproposedstrategywiselybalances loads among all the processors such that it pulls more loads from the heavily loaded processors and pushes more loads to the more lightly loaded processors. In heterogeneous computing systems, we can also observe a similar behavior demonstrated in the proposed strategy. Example 2 In this numerical example, we consider a homogeneous single-level tree network with m = 10 processing nodes. We set the processor and link speed parameters as E = 10 sec/mb and C = 1sec/MB,respectively.The Fig. 2 Load distributions of Example 1 size of the loads originating on processors are set as, {L}= {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} MB, respectively. Following the PPDD algorithm, we obtain the receiver set as {p 1,p 2,p 3,p 4,p 5 } and the final load distribution {l}= { , , 55.0, , , , , 55.0, , } and { L} ={ , , 25.0, , , , , 25.0, , }, respectively. In addition, all processors finish processing at the same time given by T(m)= sec.withoutloadre-distribution,theoverallprocessing time is 1000 sec. The load distribution is illustrated in Fig. 2. Fromthisfigure,weobservethatinthefinal load distribution the loads are almost equally distributed for this homogeneous system. In this example, due to nonnegligible communication delays incurred in load communication phase, the individual loads assigned to the processors are not identical in size. Since minimum communication delay occurs at p 5 and p 6 ( L 5 = , L 6 = ), they process additional amount of loads than other processors. Further, we observe that the distribution of the exchanged loads L i,completelyadherestothestatement of Lemma 3. One of the parameters that is often important in the study of load distribution problems on a network based environment is the ratio of communication to the computation delays. To see the effect of various communication to computation ratios, we consider the system used in Example 1 for different communication speeds, C = 1, 2,...,6, respectively, and keep all other parameters same. We denote the communication to computation speed ratio as δ = C/E, and δ = 0.1, 0.2,...,0.6. Since we consider computationintensive applications, the communication computation ratios are typically much less than 0.5 [40]. The effect of δ can be understood by observing the variation of the exchanged loads among the processors. The resulting exchanged load

9 Cluster Comput (2010) 13: Fig. 3 Exchanged load distributions for various communication to computation ratios (δ = 0.1 to 0.6) distributions are illustrated in Fig. 3 for various values of δ. From this figure, we observe that the basic tendency of the exchanged load distributions ( L i )isav-shapedcurve,that is, the amount of load exchanged first decreases and then increases, for any δ value. However, with increasing value of δ, thereceiversetgrowsfrom5processorsto7processors for a δ = 0.1toδ = 0.6, respectively, as shown in Fig. 3. According to Lemma 3, weknowthat L i > L i+1 in the receiver set and L i < L i+1 in the sender set. As a result, more loads are transferred from the last processor p m when δ is larger. From Fig. 4, weobservethatthefinalloaddistribution for δ = 0.1 iswellbalancedamongallprocessors.as the communication to computation ratio increases, in this case (without front-ends), the last processor L 10 is engaged in communication for a large amount of time rather than processing and results in less workload assignment. In Fig. 5,weseethattheoverallprocessingtimeincreases monotonically, when δ increases, as expected. 4Loadcommunicationstrategies We shall now propose a load communication strategy to efficiently implement the load balancing among all the processors using the near-optimal load distribution obtained using the PPDD strategy in the previous section. It may be noted that when one attempts to distribute the optimal load fractions between the processors in the view of balancing, some processors may go idle and the resulting overall processing time of the entire set of loads may be greater than the nearoptimal solution proposed by PPDD in the previous section. Thus, we need a communication strategy which carefully considers the communication delays while implementing PPDD strategy (balancing the load fractions). Thus, the scheduler p 0 first obtains the near-optimal fractions using PPDD which serves as the input to the communication strategy. In the following, we shall first describe the communication strategy for without front-ends, as this needs a systematic treatment. Then, for the with front-ends case, we can then design the strategy using the procedure carried out for the without front-ends case. Without loss of generality let us assume that p i S under, i = 1, 2,...,K,andp j S over, j = K + 1,K + 2,...,m, where K is the delimiter of the sender and receiver sets. The load redistribution process is described as follows. Initially, p i holds L i units of load. We shall redistribute the extra loads L j from S over to S under. However, processors p i in S under will accept only L i.thus,thesendersandreceivers are not sending and receiving the same amount of loads. Hence, we need to devise a strategy which obtains a better solution to carry out the redistribution process to minimize the overall processing time. Table 3 presents the details of the redistribution procedure and explains various stages involved in the communication process. Note that senders will transfer load to the central scheduler p 0 and the scheduler will route the load to the respective receivers. The transferred load size is determined by (5) through(8). Further, it may be observed that at any time instant, there is only one active sender and one active receiver. In the load communication strategy shown in Table 3,we denote the communication time slot of senders with a superscript s and we denote the communication time slot of receivers with a superscript r. Asmentionedinthealgorithm, because of heterogeneous communication speeds, we

10 40 Cluster Comput (2010) 13: Fig. 4 Final load distributions for various communication to computation ratios (δ = 0.1 to 0.6) Table 3 Load communication strategy for the case without front-ends Initial stage: Initially, processor p i has load L i, i = 1, 2,...,m. Sender set and receiver set will initiate communication at the same time. The first sender is p m and the first receiver is p 1. Load communication stage: Sender part: processorsp j S over, j = K + 1,...,m, send the extra load, L j which is obtained from (5) through(8), to the scheduler p 0, in the reverse order of the processor index (from p m to p K+1 ). At the beginning, processor p m starts communication at t s 0 = 0and stops at t s 1. The communication time slot for p j is [t s m j,ts m j+1 ],wherets m j is the time instant that p i starts transferring its extra load and t s m j+1 is the time instant that p i stops the communication. It is given by, t s m j+1 = ts m j + L j C j, j = 1, 2,...,m K (3) Thus, we see that the finish time at processor p j is given by, T j (m) = max{(l j L j )E j + L j C j,t s m j+1 } (4) Receiver part: processorsp i S under, i = 1, 2,...,K receive the load, L i which is obtained from (5) through(8), from the scheduler p 0 in the order of processor index from p 1, p 2,...,top K.Attimezerot0 r = 0, p 1 starts to receive load transferred from the sender set through the central scheduler and ends communication at time instant t1 r.forp i, i = 1, 2,...,K, its communication time slot is [ti 1 r,tr i ],wheretr i is given by, { t ti r r = i 1 + L i C i for C i C j (sender is faster) ti 1 r + L (5) ic j for C i <C j (receiver is faster) Hence, the finish time of p i is given by, T i (m) = max{(l i + L i )E i + L i C i,t r i + L ie i } (6) Note that in the above equation, the item L i C i may be substituted by L i C j, if the corresponding sender is slower than the receiver during this communication session. Overall processing time: T(m)= max{t i (m), T j (m)}, i = 1, 2,...,K, j = K + 1,K + 2,...,m shall calculate the communication time taken for receiving the loads from the respective senders, by the receivers. Since the scheduler works as a router, as per our earlier assumption, the communication time for a receiver is determined by the slower link connecting the active sender and the active receiver during the communication session. Note that there

11 Cluster Comput (2010) 13: Fig. 5 Overall processing times for various communication to computation ratios (δ = 0.02 to 0.6) is only one active sender and one active receiver at any instant in time. However, in the proposed strategy, at different time instants, a sender may cater for more than one receiver and a receiver may also receive loads from more than one sender. Further, the communication time for the sending part is solely governed by its own link speed while the communication time for the receiver set is determined not only by its own link speed but also depends on the senders link speed during that communication session. Now, for the case with front-ends, we shall follow the same load communication procedure explained above. However, with front-ends, we shall modify (4)and(6) as, T j (m) = max{(l j L j )E j,tm j+1 s }, (7) T i (m) = max{(l i + L i )E i,t r i + L ie i } (8) Let us now demonstrate the strategies via a detailed numerical analysis as follows. Example 3 We continue to consider the system used in Example 2 in Sect. 3 for this case study. Using the proposed load communication strategy, we obtain the following results. The finish times of processors are obtained as follows: T 1 (m) = , T 2 (m) = , T 3 (m) = , T 4 (m) = , T 5 (m) = sec.thus,theoverallfinish time of the entire set of loads is T(m)= sec, which is % more than the overall processing time obtained without considering the load communication stage. The load communication process on each processor is illustrated in Fig. 6 (not in the exact proportion). Results from Example 1 imply that the receiver set is S under ={p 1,p 2,p 3 } and the sender set is S over ={p 4,p 5 }. From the results of Example 3, weobservethatallthe processors in the sender set will finish processing at the same time while processors in the receiver set may not finish processing at the same time. This is due to the fact that the communication time of senders are determined only by their own channel speeds. However, the communication time for the receiver part is determined by the slower link between the active sender and receiver thus causing more communication delays. Figure 6 illustrates the load communication process for Example 3. Abovethetimeaxes,weshowthecommunication process using shadowed blocks, and below the time axes, we show the computation process using blank blocks. In the group of senders, p 5 starts sending its load at time zero and stops its load communication at time instant sec. During this time, p 4 is processing its own load independently. Immediately following p 5, p 4 starts communication at sec, which lasts only 1.08 sec. In the group of receivers, p 1 starts receiving its assigned load from time zero to the time instant sec. During this time period, p 2 and p 3 are processing their own load. Following p 1, p 2 continues its load communication then p 3 starts communication following p 2.Whenaprocessorisnotengagedin load communication it will process its available load independently, which is clearly demonstrated in Fig. 6.Fromthe result obtained, we also observe that the difference between the actual overall processing time and the near-optimal finish time obtained using PPDD (in the previous section), is approximately 0.16%. Thus, the proposed strategy is shown to be efficient and close to the optimal solution. 5Discussionsoftheresults The contributions in this paper are novel to the literature in DLT. The paper addresses a realistic situation in a distributed network of processors wherein the computational load can originate at any processor on the network. Thus, when there is more than one load to be processed in the system, unless a clever strategy for load distribution is carried out, the processors may not be efficiently utilized. In the existing DLT literature [32], processing multiple loads on distributed networks is addressed, however, it was assumed that all the loads originate at the central scheduler (bus controller unit in the case of bus networks). Our formulation considers loads originating at different processors on the network. We proposed load distribution (PPDD) and communication strategies for both the cases when the processors are equipped with and without front-ends. For the case with front-ends, we simply use (2) toobtain the loads exchanged among processors to obtain the final load distribution l i, i = 1, 2,...,m.Forthecasewithout front-ends, we shall follow steps proposed in Table 2. The PPDD algorithm takes advantage of the optimality principle to minimize the overall processing time. As proven in

12 42 Cluster Comput (2010) 13: Fig. 6 Timing Diagram for Example 3.Wehave5 processors in total; 2 processors (p 4 and p 5 )playthesenderrole and 3 processors (p 1, p 2 and p 3 ) are in the receiver set. We observe that senders all stop at the same time while receivers do not stop at the same time Lemma 1 and Lemma 2, theppddalgorithmguarantees to determine the near-optimal solution in finite number of steps. Since the load partitioning phase does not account the communication delays that will be encountered during the actual load communication, we obtain the near-optimal solution immediately from the procedure described in Sect. 3. When we consider the actual load transfer to the processors, PPDD algorithm is guaranteed to produce a near-optimal solution for homogeneous systems. However, in heterogeneous systems, (3) and(4) useonlythecommunication speed of the corresponding receiving link, thus causing imprecise results (when the actual load transfer takes place), as clearly demonstrated in Examples 2 and 3. This minor discrepancy in the results is due to the fact that communication speeds are different between the active sending and receiving links, whereas the actual load transferring time is determined solely by the slowest link and not by the receiving link. One may relax this assumption and consider the combined effect of both the link delays in PPDD algorithm, however, the ultimate solution may not be drastically different from what is proposed by PPDD under the current model. A significant advantage of PPDD in its current form is its simplicity to design and implement a scheduler at the root processor p 0. However, one may attempt to use other strategies to schedule and transfer loads among processors to minimize the processing time. A natural choice is to modify the load scheduling strategy proposed in the literature [6, 32]. In these works, scheduling strategies for multiple loads arriving at a bus controller unit (BCU) were studied. However, in this paper, we consider the case where multiple loads originate at different sites. In any case, we can apply the previous scheduling strategies in the literature to our problem context. At first, we consider a single-level tree network with only one load originating on a processor. Using the following equations, we obtain a near-optimal load distribution for asingleload.then,wemayrepeatthisprocedureforallthe Fig. 7 A single-level tree network with a single load loads residing at other sites. For comparison purposes, we will now consider the case without front-ends in this section. For the case with front-ends one may follow a similar procedure. Without loss of generality, we assume that the processor p 1 has a load L as shown in Fig. 7. The load distribution process is as follows. Processor p 1 partitions its load L into m portions, α 1 L, α 2 L,...,α m L. Then, p 1 distributes load fraction α 2 L to p 2 first, then α 3 L to p 3,,untilα m L to p m,respectively,viap 0.Afterthe load distribution, p 1 starts processing its own load α 1 L. The timing diagram for this load distribution process can be found in [6]. Thus, to balance the load among all processors in such a way that they finish processing at the same time, we have, α 1 LE 1 = α m LE m, (9) α i LE i = α i+1 L(max{C 1,C i+1 }+E i+1 ), i = 2, 3,...,m 1, (10)

13 Cluster Comput (2010) 13: m Example 4 We consider a homogeneous single-level tree α i = 1 (11) network with m = 5processors.Thespeedparametersare i=1 set as follows. E i = 10 sec/mb and C i = 1sec/MB,i = 1, 2,...,5. Original load status is given by, L 1 = 10 MB, L 2 = 20 MB, L 3 = 30 MB, L 4 = 40 MB, L 5 = 50 MB. The working steps of RSA are shown in Table 5. InTable 5, forexample,row1specifiesthedistributionprocess by processor p 5.Thus,startingfrom0sec,p 5 which has 50 MB (see column 2 of row 1) starts distributing the load to the rest of the processors as per the RSA algorithm and the respective load fractions are shown in the last column. Similarly the distribution process continues for other processors. Using RSA algorithm, we obtain that the finish time at each processor is as follows, T 1 (m) = sec, T 2 (m) = sec, T 3 (m) = sec, T 4 (m) = sec, T 5 (m) = sec. Thus, the overall processing time is T(m)= sec. Because the system in the above example is homogeneous, the load fractions α i are same at every iteration. However, using PPDD algorithm, we obtain that the overall processing time is T (m) = sec and all processors finish processing at the same time. As a comparison, the optimal processing time in the ideal situation is 300 secs. where, α i, i = 1, 2,...,m,isthefractionofloadL assigned to processor p i.fromfig.7, weobservethatlinkl 1 is the common link for all communications between p 1 and p i, i = 2, 3,...,m.Becausethecommunicationtimeisdetermined by the slower link, we use max{c 1,C i+1 } in the above equations to obtain the (slower) link speed between link l 1 and l i+1.fromaboveequations,wecanobtainthe individual α i and hence, the near-optimal load fraction to be assigned to p i is α i L, i = 1, 2,...,m. Equations (9) through(11) provideanear-optimalsolution for sharing one load among all processors. Now, we apply it to a new strategy for sharing multiple loads, referred to as Round-robin Scheduling Algorithm (RSA). We describe the RSA in Table 4. RSA algorithm is an extension of the load scheduling strategy for a single load. Note that when we schedule a load among all processors, processors which are not engaged in communication can process its load independently. In every iteration, in a step-by-step fashion, each processor distributes its load among all processors. Further, each processor attempts to balance all the loads among all the processors such that they finish processing at the same time. To compare the time performance of RSA and PPDD strategies, we present an example through numerical analysis. On comparison of the solutions using these two load scheduling strategies, we observe that the overall processing time using RSA is much greater than that obtained using PPDD algorithm, T(m)>T (m). Wealsoobservethat,in Table 4 Round-robin scheduling algorithm for the case without front-ends Initial stage: Initially, processor p i has load L i, i = 1, 2,...,m. Iteration stage: From p m to p 1,allprocessorsshareloadsstepbystep.Atthefirstiteration,wescheduleloadL m among all processors. After the communication of L m, the remaining load on processor p m 1 will be shared among all processors. At iteration i, p m i+1 schedules its available load among all processors and let them finish processing of this load at the same time using (9) to(11). Note that at each iteration, we distribute the load in the order of link speed, faster link first. Also, note that there is no violation of Rule A [6]. Final solution: At the end of iteration m, we stop load scheduling and let all processors finish its own load to the end. Thus the overall processing time of all loads is determined by the finish time of the processor that takes maximum time to complete the processing. Table 5 Results of Example 4 using RSA Iterations (i) Sender and load Sender s α Receivers α 1at0sec p 5,50.000MB α 5 = {1, 2, 3, 4}={0.236, 0.215, 0.195, 0.177} 2at41.136sec p 4,45.636MB α 4 = {1, 2, 3, 5}={0.236, 0.215, 0.195, 0.177} 3at78.682sec p 3,42.646MB α 3 = {1, 2, 4, 5}={0.236, 0.215, 0.195, 0.177} 4at sec p 2,41.251MB α 2 = {1, 3, 4, 5}={0.236, 0.215, 0.195, 0.177} 5at sec p 1,41.827MB α 1 = {2, 3, 4, 5}={0.236, 0.215, 0.195, 0.177} All communication is finished at the time sec. At this time, the distribution of remaining loads is as follows, {7.415, , , , MB}

14 44 Cluster Comput (2010) 13: RSA strategy, the last processor to distribute is p 1 and it is the first one to finish processing its load. The reason is that, at the last iteration, p 1 lets all other processors finish processing its remaining load at the same time while other processors have their own load during that time. At the end of the load communication phase, the remaining load at p 1 is MB (smallest), as shown in the last row of Table 5. Thus, all the other processors need more time to finish their loads than p 1 after the end of load communication. Anaturalimprovementistorepeatround-robinscheduling until the finish times of all processors are sufficiently close. But RSA cannot avoid any additional time delays (overhead) incurred due to shuttling of load from and to the same processor, i.e., a load fraction transferred from p i to p j in previous iterations may be transferred back to p i from p j or p k,thuswastingthecommunicationresources.since there is no front-end to overlap communication and computation, such kind of unnecessary load wandering greatly prolongs the overall processing time. On the other hand, RSA always needs m iterations to obtain the final solution while PPDD algorithm needs only (m K) iterations,as proven in Lemma 2. InExample4, RSAneeds5iterations while PPDD algorithm needs only one iteration to obtain a better solution. If we improve RSA through repeating the round-robin scheduling, RSA needs more iterations to obtain a better solution. However, even in this case, improved version of RSA cannot avoid load wandering (as discussed above) from and back to a processor either. 6Conclusions We have addressed the problem of scheduling strategies for divisible loads originating from multiple sites in single-level tree networks. The formulation presented a general scenario with multi-site divisible loads, demanding several processors to share their loads for processing. We have designed aloaddistributionstrategyandcommunicationstrategyto carry out the processing of all the loads submitted at various sites. A two phase approach is taken to attack the problem aloadpartitioningphaseandtheactualcommunicationof load fractions to the respective processors (communication strategy). In the first phase, we derive the near-optimal load distribution; in the second phase, we consider the actual communication delay in transferring the load fractions to the processors, by assuming that the overall delay is contributed by the slowest link between the sending and receiving processors. As a first step, one can relax this assumption and analyze the performance and the proposed scheduling strategies are flexible in adapting to such relaxed assumptions as mentioned in the discussions section. For cases with front-ends and without front-ends, we propose a scheduling strategy, PPDD algorithm, to achieve a near-optimal processing time of all loads. Several significant properties of PPDD algorithm are proven in lemmas and detailed analysis of time performance of PPDD algorithm was conducted. The above analysis is also extended to homogeneous systems wherein we have shown that the time performance of PPDD algorithm with respect to various communication-computation ratios. To implement the load distribution strategy obtained through PPDD algorithm, we proposed a simple load communication strategy. It was demonstrated that the overall processing time obtained using PPDD algorithm is sufficiently close to the result following the actual load communication strategy proposed. To further demonstrate the efficiency of PPDD algorithm, we also compared the time performance of PPDD algorithm with another algorithm, Round-robin Scheduling Algorithm (RSA). It is shown that the proposed PPDD algorithm produces better scheduling solutions than RSA. Detailed discussions and comparisons are carried out. The proposed load scheduling strategies can be readily extended to other network topologies in a similar way. Another interesting extension is to further study the case with multiple load arrivals at each processor, which models dynamic scheduling scenarios in grid or cloud computing environments. Acknowledgements The authors would like to thank the editors and referees for their valuable suggestions, which have significantly helped improve the quality and presentation of this paper. The research presented in this paper is supported in part by US National Science Foundation grant CNS Appendix Proof of Lemma 1 Since L 1 E 1 L i E i,for i 2andC i < E j for any i, j {1,...,m}, asmentionedbefore,wehave, f i = L 1E 1 L i E i C i + E i 0, i = 1, 2,...,K, (12) g i = C 1 + E 1 C i + E i > 0, i = 1, 2,...,K, (13) f i = L 1E 1 L i E i C i E i 0, i = K + 1,K + 2,...,m, (14) g i = C 1 + E 1 C i E i < 0, i = K + 1,K + 2,...,m (15) Since the loads are unbalanced, f i 0, for some i and f i 0, for some i. Fromtheclosed-formexpressionof L 1 given in (8) andusingtheaboveinequalities,weimmediately see that L 1 > 0. Similarly, by expressing all L i,i= 1, 2,...,m 1intermsof L m and through similar algebraic manipulations as we prove L 1 > 0, we can

15 Cluster Comput (2010) 13: also show that L m > 0. Therefore, there is at least one sender and one receiver available always. Now, we shall prove the second part of the lemma. For the receivers p i, i = 1,...,K,ateachiterationintheaboveload distribution strategy, we naturally have (L 1 + L 1 )E 1 + L 1 C 1 >L i E i, i = 1,...,K,whereLHSisthenewfinish time of processor p 1 after load distribution and RHS is the earlier finish time of processor p i S under.thisisbecause we attempt to balance the loads among all the processors by extending (stretching) the processing time of processors in the receiver set and reducing (shrinking) the processing time of processors in the sender set in such a way that they finish processing at the same time. Thus, from (5), we have, L i = (L 1 + L 1 )E 1 + L 1 C 1 L i E i C i + E i > 0, i = 2,...,K (16) Hence the proof. Proof of Lemma 2 This can be directly seen from an inherent property of the load distribution strategy proposed. Assume that we fail to determine a near-optimal solution satisfying the condition that L i > 0, i = 1, 2,...,m,inallprevious iterations from K to m 1. The last iteration cannot fail, since from Lemma 1 we have L m > 0andalltheother processors become potential receivers which have L i > 0, i = 1, 2,...,m 1. Hence the proof. Proof of Lemma 3 For the receiver set, from (5), we have, L i = L i+1 + (L i+1e L i E), C + E i = 1, 2,...,K 1 (17) Since L i E L i+1 E, i = 1, 2,...,K 1, from the above equation, we can immediately obtain, L i L i+1, i = 1, 2,...,K 1. For the sender set, from (6), we have, L i = L i+1 + L i+1e L i E, C E i = K + 1,...,m 1 (18) Note that (C E) < 0intheaboveequationasmentioned in Sect. 2 as per the assumptions. Since L i E L i+1 E, i = K + 1,...,m 1, from the above equation, we obtain, L i L i+1, i = K + 1,...,m 1. Hence the proof. References 1. Kesselman, C., Foster, I. (eds.): The Grid 2: Blueprint for a New Computing Infrastructure by. Morgan Kaufmann, San Mateo (2003) 2. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach, 4th edn. Morgan Kaufmann, San Mateo (2006) 3. Xu, K., Hwang, Z.: Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, New York (1998) 4. Eshaghian, M. (ed.): Heterogeneous Computing. Artech House, Norwood (1996) 5. Drozdowski, M.: Selected Problems of Scheduling Tasks in Multiprocessor Computer Systems. University of Technology Press, Poznan (1997) 6. Veeravalli, B., Ghose, D., Mani, V., Robertazzi, T. (eds.): Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos, (1996) 7. Shirazi, B., Hurson, A., Kavi, K. (eds.): Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos (1995) 8. Veeravalli, B., Ghose, D., Robertazzi, T.: Divisible load theory: a new paradigm for load scheduling in distributed systems. Clust. Comput. Div. Load Sched. 6(1), 7 18 (2003). Special Issue 9. Robertazzi, T., Sohn, J.: Optimal time-varying load sharing divisible jobs. IEEE Trans. Aerospace Electronic Syst. 34, (1998) 10. Sohn, J., Robertazzi, T., Luryi, S.: Optimizing computing costs using divisible load analysis. IEEE Trans. Parallel Distrib. Syst. 9(3), (1998) 11. Marchal, L., Yang, Y., Casanova, H., Robert, Y.: A realistic network/application model for scheduling divisible loads on largescale platforms. In: International Parallel and Distributed Processing Symposium (IPDPS 2005), Viswanathan, S., Veeravalli, B., Yu, D., Robertazzi, T.: Design and analysis of a dynamic scheduling strategy with resource estimation for large-scale grid systems. In: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing (held in conjunction with Supercomputing 2004), Pittsburgh, Pennsylvania, USA, Nov. 2004, pp Li, X., Veeravalli, B., Ko, C.: Distributed image processing on a network of workstations. Int. J. Comput. Appl. 25(2), 1 10 (2003) 14. Veeravalli, B., Ghose, D., Robertazzi, T.: A new paradigm for load scheduling in distributed systems. Divisible Load Sched. Clust. Comput. 6(1), 7 18 (2003). Special issue 15. Robertazzi, T.: Ten reasons to use divisible load theory. IEEE Comput. 36(5), (2003) 16. Ghose, D., Kim, H.J.: Load partitioning and trade-off study for large matrix-vector computations in multicast bus networks with communication delays. J. Parallel Distrib. Comput. 55(1), (1998) 17. Veeravalli, B., Barlas, G.: Efficient scheduling strategies for processing multiple divisible loads on bus networks. J. Parallel Distrib. Comput. 62(1), (2002) 18. Wong, H., Veeravalli, B.: Scheduling divisible loads on heterogeneous linear daisy chain networks with arbitrary processor release times. IEEE Trans. Parallel Distrib. Syst. 15(3), (2005) 19. Drozdowski, M., Blazewicz, J.: Distributed processing of divisible jobs with communication startup costs. Discrete Appl. Math. 76(1 3), (1997) 20. Veeravalli, B., Li, X., Ko, C.C.: On the influence of start-up costs in scheduling divisible loads on bus networks. IEEE Trans. Parallel Distrib. Syst. 11(12), (2000) 21. Li, X., Veeravalli, B., Ko, C.: Divisible load scheduling on singlelevel tree networks with buffer constraints. IEEE Trans. Aerospace Electronic Syst. 36(4), (2000) 22. Chan, S., Veeravalli, B., Ghose, D.: Large matrix-vector products on distributed bus networks with communication delays using the divisible load paradigm: Performance analysis and simulation. Math. Comput. Simul. 58, 71 79(2001) 23. Wolniewicz, P., Drozdowski, M.: Experiments with scheduling divisible tasks in clusters of workstations. In: Proceedings of the

46 Cluster Comput (2010) 13: 31 46 Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 2000, pp. 311 319 24. Ghose, D., Kim, H.

, Casanova, H., Legrand, A., Robert, Y., Yang, Y.: Scheduling divisible loads on star and tree networks: results and open problems. IEEE Trans. Parallel Distrib. Syst. 16(3), 207 218 (2005) 26.

16 46 Cluster Comput (2010) 13: Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 2000, pp Ghose, D., Kim, H.J., Kim, T.H.: Adaptive divisible load scheduling strategies for workstation clusters with unknown network resources. IEEE Trans. Parallel Distrib. Syst. 16(10), (2005) 25. Beaumont, O., Casanova, H., Legrand, A., Robert, Y., Yang, Y.: Scheduling divisible loads on star and tree networks: results and open problems. IEEE Trans. Parallel Distrib. Syst. 16(3), (2005) 26. Min, W.H., Veeravalli, B.: Aligning biological sequences on distributed bus networks: a divisible load scheduling approach. IEEE Trans. Inf. Technol. Biomed. 9(4), (2005) 27. Yao, J., Guo, J., Bhuyan, L., Xu, Z.: Scheduling real-time multimedia tasks in network processors. In: IEEE Global Telecommunications Conference (GLOBECOM 04), vol. 3, Li, X., Cao, J.: Coordinated workload scheduling in hierarchical sensor networks for data fusion applications. J. Comput. Sci. Technol. 23(3), (2008) 29. Moges, M., Robertazzi, T.G.: Wireless sensor networks: scheduling for measurement and data reporting. IEEE Trans. Aerospace Electronic Syst. 42(1), (2006) 30. Carroll, T.E., Grosu, D.: A strategyproof mechanism for scheduling divisible loads in tree networks. In: Proc. of the 20th IEEE Intl. Parallel and Distributed Processing Symp. (IPDPS 2006), Carroll, T.E., Grosu, D.: Strategyproof mechanisms for scheduling divisible loads in Bus-Networked distributed systems. IEEE Trans. Parallel Distrib. Syst. 19(8), (2008) 32. Robertazzi, T., Sohn, J.: A multi-job load sharing strategy for divisible jobs on bus networks. In: Proceedings of the Conference on Information Sciences and Systems, Princeton, NJ, March Veeravalli, B., Yao, J.: Design and performance analysis of divisible load scheduling strategies on arbitrary graphs. Clust. Comput. 7(2), (2004) 34. Haddad, E.: Real-time optimization of distributed load balancing. In: Proceedings of the Second Workshop on Parallel and Distributed Real-Time Systems, 1994, pp Robertazzi, T., Lammie, T.: A linear daisy chain with two divisible load sources. In: 2005 Conference on Information Sciences and Systems, The Johns Hopkins University, Baltimore, Maryland, March Wong, H.M., Yu, D., Veeravalli, B., Robertazzi, T.: Data intensive grid scheduling: multiple sources with capacity constraints. In: Fifteenth IASTED International Conference on Parallel and Distributed Computing and Systems, vol. 1, 2003, pp Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke, S.: The data grid: Towards an architecture for the distributed management and analysis of arge scientific datasets. J. Netw. Comput. Appl. 23, (2001) 38. Shivaratri, N., Krueger, P., Singhal, M.: Load distributing for locally distributed systems. Computer 25(12), (1992) 39. Gallager, D., Bertsekas, R. (eds.): Data Networks, 2nd edn. Prentice Hall, New York (1992) 40. Luszczek, P., Dongarra, J.: Introduction to the hpcchallenge benchmark suite. University of Tennessee, Tech. Rep. ICL-UT (2005) 41. Lee, C., Hamdi, M.: Parallel image processing applications on a network of workstations. Parallel Comput. 21, (1995) Xiaolin Li is Assistant Professor in Computer Science Department at Oklahoma State University. His research interests include Parallel and Distributed Systems, Cyber- Physical Systems, and Network Security. His research has been sponsored by several external grants, including US National Science Foundation (NSF) (PetaApps, GENI, CRI, and MRI programs), Department of Homeland Security (DHS), Oklahoma Center for the Advancement of Science and Technology (OCAST), Oklahoma Transportation Center (OTC), and industry partners. He is an associate editor of three international journals and a program chair for over 10 international conferences and workshops. He is on the executive committee of IEEE Technical Committee on Scalable Computing (TCSC) and a panelist for NSF. He has been a TPC member for numerous international conferences, including INFOCOM, GlobeCom, ICC, CCGrid, MASS, and ICPADS. He received the Ph.D. degree in Communications and Information Engineering from National University of Singapore, Singapore, and the Ph.D. degree in Computer Engineering from Rutgers University, USA. He is directing the Scalable Software Systems Laboratory ( He is a member of IEEE and ACM. Bharadwaj Veeravalli received his B.Sc. in Physics, from Madurai- Kamaraj University, India in 1987, Master s in Electrical Communication Engineering from Indian Institute of Science, Bangalore, India in 1991 and Ph.D. from Department of Aerospace Engineering, Indian Institute of Science, Bangalore, India in He did his post-doctoral research in the Department of Computer Science, Concordia University, Montreal, Canada, in He is with the Department of Electrical and Computer Engineering, Communications and Information Engineering (CIE) division, at The National University of Singapore, Singapore, as a tenured Associate Professor. His main stream research interests include, Multiprocessor systems, Cluster/Grid/Cloud computing, Scheduling in parallel and distributed systems, Bioinformatics & Computational Biology, and Multimedia computing. He is one of the earliest researchers in the field of divisible load theory (DLT). He has published over 65 papers in high-quality International Journals(Conferences). He had successfully secured several externally funded projects. He has co-authored three research monographs in the areas of PDS, Distributed Databases(competitive algorithms), and Networked Multimedia Systems, in the years 1996, 2003, and 2005, respectively. He had guest edited a special issue on Cluster/Grid Computing for IJCA, USA journal in He had served as a program committee member and as a Session Chair in several International Conferences. He is currently serving the Editorial Board of IEEE Transactions on Computers, IEEE Transactions on SMC-A, and Multimedia Tools & Applications (MTAP), USA, as an Associate Editor. He is a Senior Member of IEEE & IEEE- CS. Bharadwaj Veeravalli s complete academic career profile can be found in

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,