PPDD: scheduling multi-site divisible loads in single-level tree networks

Size: px
Start display at page:

Download "PPDD: scheduling multi-site divisible loads in single-level tree networks"

Transcription

1 Cluster Comput (2010) 13: DOI /s PPDD: scheduling multi-site divisible loads in single-level tree networks Xiaolin Li Bharadwaj Veeravalli Received: 2 July 2007 / Accepted: 1 September 2009 / Published online: 1 October 2009 SpringerScience+BusinessMedia,LLC2009 Abstract This paper investigates scheduling strategies for divisible jobs/loads originating from multiple sites in hierarchical networks with heterogeneous processors and communication channels. In contrast, most previous work in the divisible load scheduling theory (DLT) literature mainly addressed scheduling problems with loads originating from a single processor. This is one of the first works that address scheduling multiple loads from multiple sites in the DLT paradigm. In addition, scheduling multi-site jobs is common in Grids and other general distributed systems for resource sharing and coordination. An efficient static scheduling algorithm PPDD (Processor-set Partitioning and Data Distribution Algorithm) is proposed to near-optimally distribute multiple loads among all processors so that the overall processing time of all jobs is minimized. The PPDD algorithm is applied to two cases: when processors are equipped with front-ends and when they are not equipped with frontends. The application of the algorithm to homogeneous systems is also studied. Further, several important properties exhibited by the PPDD algorithm are proven through lemmas. To implement the PPDD algorithm, we propose a communication strategy. In addition, we compare the performance of the PPDD algorithm with a Round-robin Scheduling Algorithm (RSA), which is most commonly used. Ex- X. Li ( ) Computer Science Department, Oklahoma State University, 219 MSCS, Stillwater, OK 74078, USA xiaolin@cs.okstate.edu B. Veeravalli Department of Electrical and Computer Engineering, The National University of Singapore, 4 Engineering Drive 3, , Singapore, Republic of Singapore elebv@nus.edu.sg tensive case studies through numerical analysis have been conducted to verify the theoretical findings. Keywords Divisible load theory Heterogeneous computing Load scheduling Grid computing Single-level tree networks 1Introduction Parallel and distributed heterogeneous computing has become an efficient solution methodology for various realworld applications in science, engineering, and business [1 4]. One of the key issues is how to partition and schedule jobs/loads that arrive at processing nodes among the available system resources so that the best performance is achieved with respect to the finish time of all input tasks. To efficiently utilize the computing resources, researchers have contributed a large amount of load/task scheduling and balancing strategies in the literature [1, 5 7]. Recent efforts have been focused on resource sharing and coordination across multi-site resources (e.g. multiple supercomputer centers or virtual organizations). For divisible load scheduling problems, research since 1988 has established that the optimal workload allocation and scheduling to processors and links can be solved through the use of a very tractable linear model formulation, referred to as Divisible Load Theory (DLT) [6]. DLT features easy computation, a schematic language, equivalent network element modeling, results for infinite sized networks and numerous applications. This theoretical formulation opens up attractive modeling possibilities for systems incorporating communication and computation issues, as in parallel, distributed, and Grid environments. Here, the optimality, involving solution time and speedup, is derived in

2 32 Cluster Comput (2010) 13: the context of a specific scheduling policy and interconnection topology. The formulation usually generates optimal solutions via a set of linear recursive equations. In simpler models, recursive algebra also produces optimal solutions. The model takes into account the heterogeneity of processor and link speeds as well as relative computation and communication intensity. DLT can model a wide variety of approaches with respect to load distribution (sequential or concurrent), communications (store and forward and virtual cut-through switching) hardware availability (presence or absence of front end processors). Front end processors allow a processor to both communicate and compute simultaneously by assuming communication duties. A recent survey of DLT research can be found in [8]. The DLT paradigm has been proven to be remarkably flexible in handling a wide range of applications. 1.1 Related work Since the early days of DLT research, the research has spanned from addressing general optimal scheduling problems on different network topologies to various scenarios with practical constraints, such as time-varying channels [9], minimizing cost factors [10], resource management in Grid environments [11, 12], and distributed image processing [13]. Thorough surveys of DLT can be found in [5, 6, 14, 15]. Load partitioning of intensive computations of large matrix-vector products in a multicast bus network was theoretically investigated in [16]. Research efforts after 1996 particularly started focusing on including practical issues such as, scheduling multiple divisible loads [17], scheduling divisible loads with arbitrary processor release times in linear networks [18], consideration of communication startup time [19, 20], buffer constraints [21]. Some of the proposed algorithms were tested using experiments on real-life application problems such as image processing [13], matrix-vector product computations [22], and database operations [23]. Various experimental works have been done using the divisible load paradigm such as [22] for matrix-vector computation on PC clusters and [23]forother applications on a network of workstations (NOWs). Recent work in DLT also attempted to use adaptive techniques when computation needs to be performed under unknown speeds of the nodes and the links [24]. This study used bus networks as the underlying topology. Beaumont et al. consolidates the results for single-level tree and bus topologies and presents extensive discussions on some open problems in this domain [25]. A few new applications and solutions in DLT have been investigated in recent years, e.g., bioinformatics [26], multimedia streaming [27], sensor networks [28, 29], economic and game-theoretic approaches [30, 31]. Although most of the contributions in DLT literature consider only a single load originated at one processor [14, 15], scheduling multiple loads has been considered in [32] and [17]. Work presented in [33] considersprocessingdivisible loads originating from an arbitrary site on a arbitrary graph. However, they considered merely a single-site multi-load scheduling problem and didn t address multiple loads originated at arbitrary multiple sites/nodes in networks. Multisite multi-load scheduling is a practical situation, e.g., multiple jobs submitted to multiple sites in Grids. The point of load origination does impose a significant influence on the performance. In addition, when one considers multiple loads originating from several nodes/sites, it becomes much more challenging to design efficient scheduling strategies. One relevant paper to the context of the problem addressed in our work is [34]. This study investigated load scheduling and migration problems without synchronization delays in a bus network by assuming that all processors have front-ends and the communication channel can be dynamically partitioned. Front-ends are communication coprocessors, handling communication without involving processors so that communication and computation can be fully overlapped and concurrent [6]. In this case, load distribution without any consideration of synchronization delay is quite straightforward as will be shown later. However, in practice, it would be unreasonable to assume that the channel can be dynamically partitioned. In addition, we shall also consider the case when processors are not equipped with frontends. Especially, in the application of distributed sensor systems, front-end modules may be absent from the processing elements [6]. Recently, [35] investigatesthecaseoftwo load origination sources in a linear daisy chain architecture. The divisible load scheduling problems with multiple load sources in Grid environments have been studied in [11, 36]. In this paper, we consider a general load scheduling and balancing problem with multiple loads originating from multiple processors in a single-level tree network. This scenario happens commonly in realistic situations, such as applications in distributed real-time systems, collaborative grid systems (where each virtual organization can be abstracted as aresourcesiteoralocalhierarchicalnetwork),andingeneral load balancing and sharing applications [1, 2, 37]. In Grid environments, our proposed model can be applied to the following scenario: we have a super-scheduler across multiple sites and local schedulers for each site; multiple jobs are submitted to local schedulers and possibly partitioned and migrated across multiple sites by the superscheduler for resource sharing, load balancing, and high performance/throughput. 1.2 Our contributions The contributions in this paper are as follows. Primary motivation of this work stems from the fact that in a realworld scenario, there could be multiple loads submitted for

3 Cluster Comput (2010) 13: Fig. 1 A single-level tree network with multiple loads processing on networks originating from several geographically distributed sites, such as in Grid computing environments [1]. While multiple loads processing has been studied in the DLT literature [17, 32], these studies focus towards bus networks and that all the loads are available at the root (bus-controller unit) a priori. We regard these as singlesite multiple jobs problems. The study in this paper is different in formulation and attempts to provide a generalized framework. We formulate the load scheduling problem with multiple loads originating from multiple sites in single-level tree networks. 1 For the cases with and without front-ends, we design a scheduling strategy, referred to as Processorset Partitioning and Data Distribution Algorithm (PPDD), to achieve the near-optimal processing time of all loads. Several significant properties of PPDD algorithm are proven in lemmas. Detailed analysis of the time performance of PPDD algorithm is conducted. In order to actually implement the load distribution obtained through PPDD algorithm, we propose a load communication strategy. In addition, we compare the time performance of PPDD algorithm with another algorithm, referred to as Round-robin Scheduling Algorithm (RSA). It is demonstrated that the proposed PPDD algorithm produces better scheduling solutions than RSA. We testify all these findings via detailed numerical examples on heterogeneous system of processors. The contributions in this paper are expected to spur further research in this direction, especially useful while considering scheduling loads on arbitrary networks from multiple sites. This paper is organized as follows. We first formulate the problem and present some notations in Sect. 2. InSect.3, we first consider load partitioning strategies for cases with and without front-ends. Then, we present the communication strategies for these cases in Sect. 4. Weprovesome important results to analyze the performance of the algorithms. In Sect. 5,wediscussindetailandcomparethetime performance of PPDD algorithm and RSA. Section 6 con- 1 It may be noted that our formulation holds for a bus network topology which is a special case of a single-level tree network. cludes the paper and presents some possible extensions to this work. 2Problemformulationandsomenotations This section first introduces the network architecture and then presents the definitions, notations, and terminology to be used throughout the paper. As shown in the Fig. 1, we consider a single-level tree network with a root processor p 0,alsoreferredtoasaschedulerforthesystem,and m processors denoted as p 1,...,p m connected via links l 1,...,l m,respectively.weassumethattheschedulerisonly in charge of collecting the load status on each processor and routing the loads from one processor to another processor, and it does not participate in processing any load. In other words, p 0 works like a router. Initially, each processor is assumed to have a load to be processed. The objective is to minimize the overall processing time of all the loads submitted to the system (at various processors). If we do not schedule each of the loads among the set of processors, then the overall processing time of all the loads is determined by the time when the last processor finishes processing its own load. In order to minimize the overall finish time, we should carefully re-schedule and balance the loads among all processors. Also, the scheduling strategy must be such that a faster processor will process more loads while aslowerprocessorwillprocesslessloads.sinceprocessors, links, and the size of the load originating at various processors are heterogeneous (non-identical), it becomes a complex problem to obtain an optimal solution. In the load balancing literature [38], the basic rationale is to balance loads in such a way that some load fractions from overloaded processors are transferred to under-loaded processors and all the processors more-or-less have identical processing time of loads assigned to them. Here too, we follow the same strategy as a basic mechanism in balancing the divisible loads among the processors. We introduce some notations and terminology that will be used throughout the paper as follows.

4 34 Cluster Comput (2010) 13: E i :Thetimeittakestocomputeaunitloadbyprocessor p i, i = 1,...,m. C i :Thetimeittakestotransmitaunitloadonlinkl i, i = 1,...,m. L i :Theamountofloadoriginatingatp i for processing as shown in Fig. 1. l i :Theloadassignedtop i according to a scheduling strategy. η: Theloaddistributionobtained.Thisisdefinedas an m-tuple denoting the loads assigned to each p i, and is given by, η ={l 1,l 2,...,l m }.Certainly,the sum of l i, i = 1, 2,...,m,shouldbesameasthe sum of the original loads, that is, L = m i=1 l i = mi=1 L i. L i :Theloadportiontobetransferredfromortoa processor p i, i = 1, 2,...,m. T i (m): Thefinishtimeforprocessingtheloadat p i. T(m): Theoverallprocessingtimeforalltheloadsprocessed by m processors. This is given by T(m) = max m i=1 T i(m). T (m): Theoptimalprocessingtimeof alltheloads. S over :Thesetofprocessorswhichareover-loaded.Processors in this set are the potential senders of excess loads. S under :Thesetofprocessorswhichareunder-loaded. Processors in this set are the potential receivers of loads transferred from the processors in S over. In our formulation, we consider a single-level tree network with m processors and a scheduler p 0.Eachprocessor p i has its own divisible load of size L i to process, and the goal is to design an efficient scheduling strategy to minimize the overall processing time of all the loads (in all the processors) by partitioning and distributing the loads among all the m processors. Note that, the proposed scheduling strategy also takes care of the situation in which there may be only a subset of processors having loads to process. Note that T i (m), theprocessingtimeatp i,isafunctionofe i, C i, l i,and L i.accordingtotheabovedefinitions,fora given load of size L units, its computation time at processor p i is LE i and its communication time over link l i is given by LC i.notethatthecentralschedulerworkslikearouter.on anetworklink,ingeneral,thetimetakenbyaloadtoreach the final destination depends on the slowest link on its path, owing to the available bandwidth on various links comprising the path [39]. Thus, if l i and l j are the links connecting the source and destination nodes and if C i C j,thenwe assume that the communication time taken to reach the destination via the links l i and l j is simply LC j.however,it may be noted that this assumption does not affect the way in which the strategy is designed. In fact, we will show that this assumption eases analytical tractability. Without loss of generality, we index all p i in the order L i E i L i+1 E i+1, i = 1,...,m 1. Thus, without any load partitioning and scheduling, the overall processing time of all the loads is determined by the finish time of the last processor p m,whichisgivenbyt max (m) = L m E m.in the case that L i = 0, for some i [1,m], weshallgroup these processors as an equivalent processor p 0.Thus,for this case, the analysis followed will assume there is at the most one processor which has no load. Since a divisible load is assumed to be computationally intensive, a natural assumption is that the computation time for a given load is much larger than the communication time, that is, E i >C j, i, j = 1, 2,...,m.Moregeneraldiscussionon computation-intensive applications and their computationcommunication ratios can be found in [2, 40]. In addition, in [6], a condition referred to as Rule A for singlelevel tree networks was used to eliminate all the redundant processor-link pairs to obtain an optimal reduced network to achieve the optimal processing time. In our current formulation, with the assumption that E i >C j, i, j = 1, 2,...,m, Rule A is automatically satisfied. The reader is referred to the implications of Rule A as explained in [6]. In the next section, we shall first identify a condition that designates a processor as an over-loaded processor or as an under-loaded processor and form S over and S under sets, respectively. Then, we obtain the exact load fractions L i, p i S under,tobereceivedbyprocessorsins under and L j, p j S over,tobeextractedfromprocessorsins over to minimize the overall processing time. Thus, we obtain the exact load portion assigned to p i as l i = L i + L i, p i S under,or l j = L j L j, p j S over,fori, j {1,...,m}. Fromthe resulting load distribution η, weobtaintheoverallprocessing time of all the loads. 3Loadpartitioningstrategies In this section, we consider two cases, namely when all the processors are equipped with front-ends and when all the processors are not equipped with front-ends. In the case with front-ends, we can improve the processing time performance through efficiently overlapping communication with computation [6]. However, in the case without front-ends, communication and computation for each processor cannot be fully overlapped. and communication delays incurred while redistributing the loads among processors need to be minimized. The strategy involves two phases. In the first phase, the entire set of loads are partitioned. In the second phase, the partitioned loads are transferred from one processor to another following a communication strategy. These two phases will be carried out for both with and without front-end cases. This section focuses on the first phase and the next section will investigate the second phase. The implementation of the proposed PPDD algorithm also involves these two phases. In the first phase, the scheduler p 0 collects all load distribution information about all slave processors and applies the

5 Cluster Comput (2010) 13: PPDD algorithm to obtain the near-optimal load partitions. In the second phase, the set of overloaded processors initiates sending data and the set of under-loaded processors initiates receiving data; the scheduler coordinates these slaves sending or receiving operations by routing data among them. Note that, although PPDD algorithm appears iterative to obtain the near-optimal data partition, the amount of load migration/partition for each processor will be adjusted only once. We assume that each processor has its own load to process initially. A processor can start processing its own load or communicating with other processors from time, say t = 0onwards,asperthedesignofaschedulingstrategy. During some time interval, a processor may not have any load available to process and also may not be engaged in receiving any load from other processors. In this situation, aprocessormaysimplyremainidlewithoutanyactivity. However, this processor may be assigned a load portion at some time later by the scheduler; hence, till that time, this processor will keep idle. We refer this idle time interval as a starvation gap in the rest of the paper. Efficient load balancing strategies are thus expected to minimize these starvation gaps and maximize the system utilization. 3.1 With front-ends For the case with front-ends, consider an ideal situation in which there is no starvation gap. Also, we assume that the entire communication can be overlapped by computation. In other words, a processor will not starve for the data while it is receiving it from other processors and it will be engaged in processing its own load. We refer to this situation as ideal case, hereafter.toachievetheoptimalprocessingtimefor the entire set of loads, we should balance the loads among all the processors such that all the participating processors finish processing at the same time instant. We use this criterion as the optimality condition to determine the optimal solution as in [6]. Intuitively, if there are some processors complete processing earlier and other processors complete processing later, we can reschedule some workload from those late processors to those early processors to reduce the overall processing time (which is determined by the processor that finishes the last). Thus, for the ideal situation mentioned above, the optimal processing time is given by T ideal (m) = mi=1 L i mi=1 1 Ei (1) In the above equation, the numerator is the summation of all the loads and the denominator is the summation of the total processing power available with the system. From (1), we can obtain the load portions to be transferred from/to the nodes as, L L i = i E i Tideal (m) = L i l i, E i i = 1, 2,...,m (2) where l i = T ideal (m)/e i, i = 1, 2,...,m,andl i is the load processed at processor p i after balancing. Note that when L i >l i,processorp i belongs to the set of overloaded nodes S over (senders) and hence, the load at p i should be transferred to other nodes. On the other hand, when L i l i, p i belongs to the set of under-loaded nodes S under (receivers), to which the load from other nodes will be transferred. As mentioned earlier, since we index the processors in the order of minimum E i L i first, we can obtain an integer K uniquely such that p i S under, i = 1, 2,...,K and p i S over, i = K + 1,K + 2,...,m.WerefertoK as a delimiter to separate these receiver and sender sets. Thus, for all p i S under,wehavel i <l i and for all p j S over,we have L j >l j,respectively. The load distribution algorithm is presented in Table 1. We initially use the optimal solution obtained for the ideal case and determine a delimiter K to identify the potential senders and receivers. Then, using (2) wederivetheload fractions to be exchanged, L i, i = 1,...,m,thusobtaining a load distribution η.becauseoftheassumptionthatthe entire communication can be overlapped by computation, we immediately obtain the finish time for each processor as T i (m) = l i E i.sincethealgorithmpartitionstheprocessors into two sets first and then distributes extra loads from the sender set to the receiver set, we refer to this algorithm as Processor-set Partitioning and Data Distribution Algorithm (PPDD). Table 1 PPDD algorithm for the case with front-ends Initial stage: From (1)and(2), we obtain the initial delimiter K which separates the sender and receiver sets. Load distribution: The load assigned to p i is l i = L i + L i, i = 1, 2,...,K and l i = L i L i, i = K + 1,K + 2,...,m. Overall processing time: The finish time for processor p i is given by, T i (m) = l i E i. Thus, we obtain the overall processing time T(m)= max{t i (m)}, i = 1, 2,...,m.

6 36 Cluster Comput (2010) 13: Note that in the above strategy for load distribution, we have not explicitly discussed how the loads are communicated to the respective processors, rather we have discussed how much load a processor is assigned from the entire set of loads. We shall discuss the communication strategy in the next section. However, in practical situations, there may be starvation gaps and not all communication can be overlapped with computation. We will see how this issue is addressed in Sect. 4,whenwediscusstheloadcommunication strategies. 3.2 Without front-ends Table 2 describes the proposed algorithm for finding a load distribution for the case without front-ends in detail. This algorithm operates in three steps. In the first step, an initial solution is obtained by using (1) and(2), then corresponding sender and receiver sets are formed as described in the previous section. Note that (1)and(2) are for the case with front-end. In the second step, the feasibility of the resulting load distribution obtained is validated. When all the resulting L j > 0, j = 1,...,m,arepositive,PPDDalgorithm obtains a feasible delimiter K and stops the iteration. Then it obtains the final load distribution, η using (9). In the last step, following the load distribution obtained above, we calculate the overall processing time of all the loads. Since the basic style of working is identical to the case with front-end, we continue to refer to this algorithm simply as Processor-set Partitioning and Data Distribution Algorithm (PPDD). Table 2 PPDD algorithm for the case without front-ends Initial phase: From (1)and(2), we obtain the initial delimiter K which identifies the potential sender and receiver sets. Iteration phase: We assume that a processor p i is a sender, if i>kor a receiver, if i K. Assuming all processors finish processing at the same time, denoted as T x (m), x = 1,...,m,toachievetheoptimalprocessingtime,wehave, T x (m) = L i E i + L i (C i + E i ), i = 1, 2,...,K, (3) T x (m) = L j E j + L j (C j E j ), j = K + 1,...,m (4) Thus, expressing all the L i in terms of L 1, for the receiver set, we obtain, L i = f i + g i L 1, i = 1, 2,...,K where f i = L 1E 1 L i E i C i +E i and g i = C 1+E 1 C i +E i, i = 1, 2,...,K. For the sender set, we have L i = f i + g i L 1, i = K + 1,K + 2,...,m (6) where f i = L 1E 1 L i E i C i E i and g i = C 1+E 1 C i E i, i = K + 1,K + 2,...,m. Since the sum of loads transferred from the sender set and the sum of loads received by the receiver set must be identical, we have (5) K L i = i=1 m i=k+1 L i Thus, from (5), (6)and(7), the closed-form solution for L 1 is given by, mi=k+1 f i L 1 = K i=1 f i Ki=1 g i m i=k+1 g i (7) (8) Equations (5), (6) and(8) givethesolutionfor L 1, L 2,..., L m,whichshouldallbenon-negative.ifanyresulting L i is negative, we update the receiver set and sender set by moving p K+1 from S over to S under and increase K by 1. We repeat the calculations given by (3)to(8)untilall L 1, L 2,..., L m are non-negative. Overall processing time: The loads assigned to individual processors are given by, l i = { Li + L i = L i + f i + g i L 1, i = 1,...,K L i L i = L i f i + g i L 1, i = K + 1,...,m The finish time of processor p i is given by, { Li E i + (f i + g i L 1 )(C i + E i ), i = 1, 2,...,K T i (m) = L i E i + (f i + g i L 1)(C i E i ), i = K + 1,...,m Thus, the overall processing time is, T(m)= max{t i (m)}, i = 1, 2,...,m. (9) (10)

7 Cluster Comput (2010) 13: It may be noted that since the load distribution for the ideal case serves as a convenient starting point for the case without front-ends, in the above load distribution strategy for without front-ends, we avoid the possibility of iterating from the value of K = 1onwards.Further,theoptimalload distribution for the ideal case will definitely identify a larger sender set than the optimal load scheduling for the case without front-ends, because all the communication can be overlapped with the computation in the ideal case. Also, for the ideal case, we expect a smaller overall processing time than that of the case without front-ends. In addition, we initially designate all the processors as senders whose original processing times (L i E i )aregreaterthantheirexpected overall processing time. This is our basic idea to judiciously use the initial results obtained as the starting point of the algorithm. Thus, we increase K in a step by step fashion to shrink the sender set in order to find the ultimate feasible receiver and sender sets. The initial choice of the K determines the number of iterations needed for the PPDD algorithm. From the equations from (3) through(8), we observe that the algorithm always guarantees that the resulting load distribution will make all the processors finish processing at the same time. We present several significant properties exhibited by the proposed strategy below. Lemma 1 In the case without front-ends, whenever the loads are not balanced, i.e., L i E i L j E j for some i j, there is at least one sender and one receiver such that L i > 0 for all receivers p i S under. Lemma 2 The load distribution strategy takes a finite number of steps to converge. It needs only n<(m K) iterations to obtain a near-optimal solution. The proofs of the above lemmas can be found in the Appendix. Notethat,theabovetwolemmasholdalsofor the case with front-ends. All the properties presented in the above lemmas lead to the following conjecture. Conjecture 1 The load distribution strategy presented in Tables 1 and 2 yield optimal solutions for cases with and without front-ends, respectively. Arigorousproofcanbeattemptedfollowingthetreatment presented in [6]. The basic idea is to use contradictions if we do not follow the solution given by the PPDD algorithm. Due to the uncertainty of processing speed distribution in heterogeneous systems, we have not derived a satisfactory rigorous proof for this conjecture. Our ongoing work is to derive a proof of statistical optimality for the PPDD algorithm. However, we observe that from the workings of the algorithms, any re-distribution from the load scheduling proposed above will cause an imbalance of the loads among all processors and will result in under-utilizing certain processors due to the fact that some processors may be busy in processing while others have finished their tasks, thus increasing the overall processing time. Based on above proofs and observations, we argue that PPDD algorithm yields near-optimal solutions. To see the working steps of this algorithm, we present anumericalexamplewiththefollowingspeedparameters. This is for the case without front-ends. Note that, since Example 1 to 4 are based on numerical analysis, results are stable and deterministic. These examples are used to verify our theoretical analysis and demonstrate certain features of our proposed algorithms more vividly. The range of parameters (normalized processor and link speeds) and computation to communication ratios used in these examples follow the observations and guidelines in [2, 40]. Example 1 In this example, we consider a single-level tree network with m = 5 processors and a root node (central scheduler). The system parameters we had set are, processor speeds E 1 = 50 sec/mb, E 2 = 65 sec/mb, E 3 = 60 sec/mb, E 4 = 45 sec/mb, E 5 = 80 sec/mb, and link speeds C 1 = 0.3 sec/mb, C 2 = 0.2 sec/mb, C 3 = 0.15 sec/mb, C 4 = 0.1 sec/mb,c 5 = 0.55 sec/mb. These parameters are typical for image processing applications [13, 41]. The size of the respective loads injected on each processor is as follows. L 1 = 100 MB, L 2 = 110 MB, L 3 = 120 MB, L 4 = 180 MB, L 5 = 150 MB. We index the processors in the order of the smallest L i E i first, as mentioned before. Note that the original processing time at each processor (calculated using L i E i, i = 1,...,5) is, 5000, 7150, 7200, 8100, and sec, respectively, in an incremental order. Thus, if each processor processes its own load without sharing with other processors, the overall processing time of the entire set of loads is sec and the average processing time is 7890 sec. Using (2), we first obtain the ideal scheduling solution as follows. The sender and receiver set delimiter is K = 3, and hence, the sender set is given by, S over = {p 4,p 5 } and receiver set is given by, S under ={p 1,p 2,p 3 }. The amount of load migration is L 1 = 52.12, L 2 = 7.02, L 3 = 6.77, L 4 = 10.97, L 5 = 54.92, respectively. In the ideal case, the resulting schedule makes all processors finish processing at the same time instant and the overall processing time is given by sec. Following PPDD algorithm, for the case without frontends presented above, using K = 3astheinitialstarting point, after one iteration, we obtain the scheduling solution as follows. The sender and receiver delimiter is still K = 3, thus the sender set is S over ={p 4,p 5 } and receiver

8 38 Cluster Comput (2010) 13: set is S under ={p 1,p 2,p 3 }.Theamountofloadexchanged is L 1 = 51.98, L 2 = 7.13, L 3 = 6.89, L 4 = 10.81, L 5 = 55.19, respectively. Observe that all processors finish processing at the same time and the overall finish time of the entire set of loads is sec. From the above example, we observe that the resulting overall processing time, for the case without frontends, is quite close to the ideal case and is much less than the original processing time without load sharing (less than 36.5%). In addition, the near-optimal finish time obtained, sec, is even better than the average of the original individual processing times of the respective loads, obtained as 7890 sec. Above results clearly elicit the fact that any naive strategy which aims to achieve an average processing time or considers assigning equal size portions among the processors (average of all the loads) will not result in a good solution in heterogeneous computing networks. 3.3 Homogeneous systems without front-ends To gain more insights on the properties of the proposed algorithms, we further conduct analysis of homogeneous systems. Due to the irregular (sometime random) parameters in heterogeneous systems, it is difficult to observe some natural trends of the performance and load distribution following the PPDD algorithm. Homogeneous settings offer us an opportunity to examine some special behaviors of systems and findings in homogeneous systems can be used as a reference or approximation for similar heterogeneous systems. For a homogeneous system, we have all C i = C, i = 1, 2,...,m and all E i = E, i = 1, 2,...,m.Inthiscase,weobserve some interesting special properties exhibited by the load partitioning strategy. Lemma 3 In a homogeneous system, in the near-optimal load distribution obtained using the proposed strategy, we always have, for the receiver set, L i L i+1, i = 1, 2,...,K 1, and for the sender set, L i L i+1, i = K + 1,...,m 1. The proof of this lemma is presented in Appendix. From Lemma 3,weobservethattheproposedstrategywiselybalances loads among all the processors such that it pulls more loads from the heavily loaded processors and pushes more loads to the more lightly loaded processors. In heterogeneous computing systems, we can also observe a similar behavior demonstrated in the proposed strategy. Example 2 In this numerical example, we consider a homogeneous single-level tree network with m = 10 processing nodes. We set the processor and link speed parameters as E = 10 sec/mb and C = 1sec/MB,respectively.The Fig. 2 Load distributions of Example 1 size of the loads originating on processors are set as, {L}= {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} MB, respectively. Following the PPDD algorithm, we obtain the receiver set as {p 1,p 2,p 3,p 4,p 5 } and the final load distribution {l}= { , , 55.0, , , , , 55.0, , } and { L} ={ , , 25.0, , , , , 25.0, , }, respectively. In addition, all processors finish processing at the same time given by T(m)= sec.withoutloadre-distribution,theoverallprocessing time is 1000 sec. The load distribution is illustrated in Fig. 2. Fromthisfigure,weobservethatinthefinal load distribution the loads are almost equally distributed for this homogeneous system. In this example, due to nonnegligible communication delays incurred in load communication phase, the individual loads assigned to the processors are not identical in size. Since minimum communication delay occurs at p 5 and p 6 ( L 5 = , L 6 = ), they process additional amount of loads than other processors. Further, we observe that the distribution of the exchanged loads L i,completelyadherestothestatement of Lemma 3. One of the parameters that is often important in the study of load distribution problems on a network based environment is the ratio of communication to the computation delays. To see the effect of various communication to computation ratios, we consider the system used in Example 1 for different communication speeds, C = 1, 2,...,6, respectively, and keep all other parameters same. We denote the communication to computation speed ratio as δ = C/E, and δ = 0.1, 0.2,...,0.6. Since we consider computationintensive applications, the communication computation ratios are typically much less than 0.5 [40]. The effect of δ can be understood by observing the variation of the exchanged loads among the processors. The resulting exchanged load

9 Cluster Comput (2010) 13: Fig. 3 Exchanged load distributions for various communication to computation ratios (δ = 0.1 to 0.6) distributions are illustrated in Fig. 3 for various values of δ. From this figure, we observe that the basic tendency of the exchanged load distributions ( L i )isav-shapedcurve,that is, the amount of load exchanged first decreases and then increases, for any δ value. However, with increasing value of δ, thereceiversetgrowsfrom5processorsto7processors for a δ = 0.1toδ = 0.6, respectively, as shown in Fig. 3. According to Lemma 3, weknowthat L i > L i+1 in the receiver set and L i < L i+1 in the sender set. As a result, more loads are transferred from the last processor p m when δ is larger. From Fig. 4, weobservethatthefinalloaddistribution for δ = 0.1 iswellbalancedamongallprocessors.as the communication to computation ratio increases, in this case (without front-ends), the last processor L 10 is engaged in communication for a large amount of time rather than processing and results in less workload assignment. In Fig. 5,weseethattheoverallprocessingtimeincreases monotonically, when δ increases, as expected. 4Loadcommunicationstrategies We shall now propose a load communication strategy to efficiently implement the load balancing among all the processors using the near-optimal load distribution obtained using the PPDD strategy in the previous section. It may be noted that when one attempts to distribute the optimal load fractions between the processors in the view of balancing, some processors may go idle and the resulting overall processing time of the entire set of loads may be greater than the nearoptimal solution proposed by PPDD in the previous section. Thus, we need a communication strategy which carefully considers the communication delays while implementing PPDD strategy (balancing the load fractions). Thus, the scheduler p 0 first obtains the near-optimal fractions using PPDD which serves as the input to the communication strategy. In the following, we shall first describe the communication strategy for without front-ends, as this needs a systematic treatment. Then, for the with front-ends case, we can then design the strategy using the procedure carried out for the without front-ends case. Without loss of generality let us assume that p i S under, i = 1, 2,...,K,andp j S over, j = K + 1,K + 2,...,m, where K is the delimiter of the sender and receiver sets. The load redistribution process is described as follows. Initially, p i holds L i units of load. We shall redistribute the extra loads L j from S over to S under. However, processors p i in S under will accept only L i.thus,thesendersandreceivers are not sending and receiving the same amount of loads. Hence, we need to devise a strategy which obtains a better solution to carry out the redistribution process to minimize the overall processing time. Table 3 presents the details of the redistribution procedure and explains various stages involved in the communication process. Note that senders will transfer load to the central scheduler p 0 and the scheduler will route the load to the respective receivers. The transferred load size is determined by (5) through(8). Further, it may be observed that at any time instant, there is only one active sender and one active receiver. In the load communication strategy shown in Table 3,we denote the communication time slot of senders with a superscript s and we denote the communication time slot of receivers with a superscript r. Asmentionedinthealgorithm, because of heterogeneous communication speeds, we

10 40 Cluster Comput (2010) 13: Fig. 4 Final load distributions for various communication to computation ratios (δ = 0.1 to 0.6) Table 3 Load communication strategy for the case without front-ends Initial stage: Initially, processor p i has load L i, i = 1, 2,...,m. Sender set and receiver set will initiate communication at the same time. The first sender is p m and the first receiver is p 1. Load communication stage: Sender part: processorsp j S over, j = K + 1,...,m, send the extra load, L j which is obtained from (5) through(8), to the scheduler p 0, in the reverse order of the processor index (from p m to p K+1 ). At the beginning, processor p m starts communication at t s 0 = 0and stops at t s 1. The communication time slot for p j is [t s m j,ts m j+1 ],wherets m j is the time instant that p i starts transferring its extra load and t s m j+1 is the time instant that p i stops the communication. It is given by, t s m j+1 = ts m j + L j C j, j = 1, 2,...,m K (3) Thus, we see that the finish time at processor p j is given by, T j (m) = max{(l j L j )E j + L j C j,t s m j+1 } (4) Receiver part: processorsp i S under, i = 1, 2,...,K receive the load, L i which is obtained from (5) through(8), from the scheduler p 0 in the order of processor index from p 1, p 2,...,top K.Attimezerot0 r = 0, p 1 starts to receive load transferred from the sender set through the central scheduler and ends communication at time instant t1 r.forp i, i = 1, 2,...,K, its communication time slot is [ti 1 r,tr i ],wheretr i is given by, { t ti r r = i 1 + L i C i for C i C j (sender is faster) ti 1 r + L (5) ic j for C i <C j (receiver is faster) Hence, the finish time of p i is given by, T i (m) = max{(l i + L i )E i + L i C i,t r i + L ie i } (6) Note that in the above equation, the item L i C i may be substituted by L i C j, if the corresponding sender is slower than the receiver during this communication session. Overall processing time: T(m)= max{t i (m), T j (m)}, i = 1, 2,...,K, j = K + 1,K + 2,...,m shall calculate the communication time taken for receiving the loads from the respective senders, by the receivers. Since the scheduler works as a router, as per our earlier assumption, the communication time for a receiver is determined by the slower link connecting the active sender and the active receiver during the communication session. Note that there

11 Cluster Comput (2010) 13: Fig. 5 Overall processing times for various communication to computation ratios (δ = 0.02 to 0.6) is only one active sender and one active receiver at any instant in time. However, in the proposed strategy, at different time instants, a sender may cater for more than one receiver and a receiver may also receive loads from more than one sender. Further, the communication time for the sending part is solely governed by its own link speed while the communication time for the receiver set is determined not only by its own link speed but also depends on the senders link speed during that communication session. Now, for the case with front-ends, we shall follow the same load communication procedure explained above. However, with front-ends, we shall modify (4)and(6) as, T j (m) = max{(l j L j )E j,tm j+1 s }, (7) T i (m) = max{(l i + L i )E i,t r i + L ie i } (8) Let us now demonstrate the strategies via a detailed numerical analysis as follows. Example 3 We continue to consider the system used in Example 2 in Sect. 3 for this case study. Using the proposed load communication strategy, we obtain the following results. The finish times of processors are obtained as follows: T 1 (m) = , T 2 (m) = , T 3 (m) = , T 4 (m) = , T 5 (m) = sec.thus,theoverallfinish time of the entire set of loads is T(m)= sec, which is % more than the overall processing time obtained without considering the load communication stage. The load communication process on each processor is illustrated in Fig. 6 (not in the exact proportion). Results from Example 1 imply that the receiver set is S under ={p 1,p 2,p 3 } and the sender set is S over ={p 4,p 5 }. From the results of Example 3, weobservethatallthe processors in the sender set will finish processing at the same time while processors in the receiver set may not finish processing at the same time. This is due to the fact that the communication time of senders are determined only by their own channel speeds. However, the communication time for the receiver part is determined by the slower link between the active sender and receiver thus causing more communication delays. Figure 6 illustrates the load communication process for Example 3. Abovethetimeaxes,weshowthecommunication process using shadowed blocks, and below the time axes, we show the computation process using blank blocks. In the group of senders, p 5 starts sending its load at time zero and stops its load communication at time instant sec. During this time, p 4 is processing its own load independently. Immediately following p 5, p 4 starts communication at sec, which lasts only 1.08 sec. In the group of receivers, p 1 starts receiving its assigned load from time zero to the time instant sec. During this time period, p 2 and p 3 are processing their own load. Following p 1, p 2 continues its load communication then p 3 starts communication following p 2.Whenaprocessorisnotengagedin load communication it will process its available load independently, which is clearly demonstrated in Fig. 6.Fromthe result obtained, we also observe that the difference between the actual overall processing time and the near-optimal finish time obtained using PPDD (in the previous section), is approximately 0.16%. Thus, the proposed strategy is shown to be efficient and close to the optimal solution. 5Discussionsoftheresults The contributions in this paper are novel to the literature in DLT. The paper addresses a realistic situation in a distributed network of processors wherein the computational load can originate at any processor on the network. Thus, when there is more than one load to be processed in the system, unless a clever strategy for load distribution is carried out, the processors may not be efficiently utilized. In the existing DLT literature [32], processing multiple loads on distributed networks is addressed, however, it was assumed that all the loads originate at the central scheduler (bus controller unit in the case of bus networks). Our formulation considers loads originating at different processors on the network. We proposed load distribution (PPDD) and communication strategies for both the cases when the processors are equipped with and without front-ends. For the case with front-ends, we simply use (2) toobtain the loads exchanged among processors to obtain the final load distribution l i, i = 1, 2,...,m.Forthecasewithout front-ends, we shall follow steps proposed in Table 2. The PPDD algorithm takes advantage of the optimality principle to minimize the overall processing time. As proven in

12 42 Cluster Comput (2010) 13: Fig. 6 Timing Diagram for Example 3.Wehave5 processors in total; 2 processors (p 4 and p 5 )playthesenderrole and 3 processors (p 1, p 2 and p 3 ) are in the receiver set. We observe that senders all stop at the same time while receivers do not stop at the same time Lemma 1 and Lemma 2, theppddalgorithmguarantees to determine the near-optimal solution in finite number of steps. Since the load partitioning phase does not account the communication delays that will be encountered during the actual load communication, we obtain the near-optimal solution immediately from the procedure described in Sect. 3. When we consider the actual load transfer to the processors, PPDD algorithm is guaranteed to produce a near-optimal solution for homogeneous systems. However, in heterogeneous systems, (3) and(4) useonlythecommunication speed of the corresponding receiving link, thus causing imprecise results (when the actual load transfer takes place), as clearly demonstrated in Examples 2 and 3. This minor discrepancy in the results is due to the fact that communication speeds are different between the active sending and receiving links, whereas the actual load transferring time is determined solely by the slowest link and not by the receiving link. One may relax this assumption and consider the combined effect of both the link delays in PPDD algorithm, however, the ultimate solution may not be drastically different from what is proposed by PPDD under the current model. A significant advantage of PPDD in its current form is its simplicity to design and implement a scheduler at the root processor p 0. However, one may attempt to use other strategies to schedule and transfer loads among processors to minimize the processing time. A natural choice is to modify the load scheduling strategy proposed in the literature [6, 32]. In these works, scheduling strategies for multiple loads arriving at a bus controller unit (BCU) were studied. However, in this paper, we consider the case where multiple loads originate at different sites. In any case, we can apply the previous scheduling strategies in the literature to our problem context. At first, we consider a single-level tree network with only one load originating on a processor. Using the following equations, we obtain a near-optimal load distribution for asingleload.then,wemayrepeatthisprocedureforallthe Fig. 7 A single-level tree network with a single load loads residing at other sites. For comparison purposes, we will now consider the case without front-ends in this section. For the case with front-ends one may follow a similar procedure. Without loss of generality, we assume that the processor p 1 has a load L as shown in Fig. 7. The load distribution process is as follows. Processor p 1 partitions its load L into m portions, α 1 L, α 2 L,...,α m L. Then, p 1 distributes load fraction α 2 L to p 2 first, then α 3 L to p 3,,untilα m L to p m,respectively,viap 0.Afterthe load distribution, p 1 starts processing its own load α 1 L. The timing diagram for this load distribution process can be found in [6]. Thus, to balance the load among all processors in such a way that they finish processing at the same time, we have, α 1 LE 1 = α m LE m, (9) α i LE i = α i+1 L(max{C 1,C i+1 }+E i+1 ), i = 2, 3,...,m 1, (10)

13 Cluster Comput (2010) 13: m Example 4 We consider a homogeneous single-level tree α i = 1 (11) network with m = 5processors.Thespeedparametersare i=1 set as follows. E i = 10 sec/mb and C i = 1sec/MB,i = 1, 2,...,5. Original load status is given by, L 1 = 10 MB, L 2 = 20 MB, L 3 = 30 MB, L 4 = 40 MB, L 5 = 50 MB. The working steps of RSA are shown in Table 5. InTable 5, forexample,row1specifiesthedistributionprocess by processor p 5.Thus,startingfrom0sec,p 5 which has 50 MB (see column 2 of row 1) starts distributing the load to the rest of the processors as per the RSA algorithm and the respective load fractions are shown in the last column. Similarly the distribution process continues for other processors. Using RSA algorithm, we obtain that the finish time at each processor is as follows, T 1 (m) = sec, T 2 (m) = sec, T 3 (m) = sec, T 4 (m) = sec, T 5 (m) = sec. Thus, the overall processing time is T(m)= sec. Because the system in the above example is homogeneous, the load fractions α i are same at every iteration. However, using PPDD algorithm, we obtain that the overall processing time is T (m) = sec and all processors finish processing at the same time. As a comparison, the optimal processing time in the ideal situation is 300 secs. where, α i, i = 1, 2,...,m,isthefractionofloadL assigned to processor p i.fromfig.7, weobservethatlinkl 1 is the common link for all communications between p 1 and p i, i = 2, 3,...,m.Becausethecommunicationtimeisdetermined by the slower link, we use max{c 1,C i+1 } in the above equations to obtain the (slower) link speed between link l 1 and l i+1.fromaboveequations,wecanobtainthe individual α i and hence, the near-optimal load fraction to be assigned to p i is α i L, i = 1, 2,...,m. Equations (9) through(11) provideanear-optimalsolution for sharing one load among all processors. Now, we apply it to a new strategy for sharing multiple loads, referred to as Round-robin Scheduling Algorithm (RSA). We describe the RSA in Table 4. RSA algorithm is an extension of the load scheduling strategy for a single load. Note that when we schedule a load among all processors, processors which are not engaged in communication can process its load independently. In every iteration, in a step-by-step fashion, each processor distributes its load among all processors. Further, each processor attempts to balance all the loads among all the processors such that they finish processing at the same time. To compare the time performance of RSA and PPDD strategies, we present an example through numerical analysis. On comparison of the solutions using these two load scheduling strategies, we observe that the overall processing time using RSA is much greater than that obtained using PPDD algorithm, T(m)>T (m). Wealsoobservethat,in Table 4 Round-robin scheduling algorithm for the case without front-ends Initial stage: Initially, processor p i has load L i, i = 1, 2,...,m. Iteration stage: From p m to p 1,allprocessorsshareloadsstepbystep.Atthefirstiteration,wescheduleloadL m among all processors. After the communication of L m, the remaining load on processor p m 1 will be shared among all processors. At iteration i, p m i+1 schedules its available load among all processors and let them finish processing of this load at the same time using (9) to(11). Note that at each iteration, we distribute the load in the order of link speed, faster link first. Also, note that there is no violation of Rule A [6]. Final solution: At the end of iteration m, we stop load scheduling and let all processors finish its own load to the end. Thus the overall processing time of all loads is determined by the finish time of the processor that takes maximum time to complete the processing. Table 5 Results of Example 4 using RSA Iterations (i) Sender and load Sender s α Receivers α 1at0sec p 5,50.000MB α 5 = {1, 2, 3, 4}={0.236, 0.215, 0.195, 0.177} 2at41.136sec p 4,45.636MB α 4 = {1, 2, 3, 5}={0.236, 0.215, 0.195, 0.177} 3at78.682sec p 3,42.646MB α 3 = {1, 2, 4, 5}={0.236, 0.215, 0.195, 0.177} 4at sec p 2,41.251MB α 2 = {1, 3, 4, 5}={0.236, 0.215, 0.195, 0.177} 5at sec p 1,41.827MB α 1 = {2, 3, 4, 5}={0.236, 0.215, 0.195, 0.177} All communication is finished at the time sec. At this time, the distribution of remaining loads is as follows, {7.415, , , , MB}

14 44 Cluster Comput (2010) 13: RSA strategy, the last processor to distribute is p 1 and it is the first one to finish processing its load. The reason is that, at the last iteration, p 1 lets all other processors finish processing its remaining load at the same time while other processors have their own load during that time. At the end of the load communication phase, the remaining load at p 1 is MB (smallest), as shown in the last row of Table 5. Thus, all the other processors need more time to finish their loads than p 1 after the end of load communication. Anaturalimprovementistorepeatround-robinscheduling until the finish times of all processors are sufficiently close. But RSA cannot avoid any additional time delays (overhead) incurred due to shuttling of load from and to the same processor, i.e., a load fraction transferred from p i to p j in previous iterations may be transferred back to p i from p j or p k,thuswastingthecommunicationresources.since there is no front-end to overlap communication and computation, such kind of unnecessary load wandering greatly prolongs the overall processing time. On the other hand, RSA always needs m iterations to obtain the final solution while PPDD algorithm needs only (m K) iterations,as proven in Lemma 2. InExample4, RSAneeds5iterations while PPDD algorithm needs only one iteration to obtain a better solution. If we improve RSA through repeating the round-robin scheduling, RSA needs more iterations to obtain a better solution. However, even in this case, improved version of RSA cannot avoid load wandering (as discussed above) from and back to a processor either. 6Conclusions We have addressed the problem of scheduling strategies for divisible loads originating from multiple sites in single-level tree networks. The formulation presented a general scenario with multi-site divisible loads, demanding several processors to share their loads for processing. We have designed aloaddistributionstrategyandcommunicationstrategyto carry out the processing of all the loads submitted at various sites. A two phase approach is taken to attack the problem aloadpartitioningphaseandtheactualcommunicationof load fractions to the respective processors (communication strategy). In the first phase, we derive the near-optimal load distribution; in the second phase, we consider the actual communication delay in transferring the load fractions to the processors, by assuming that the overall delay is contributed by the slowest link between the sending and receiving processors. As a first step, one can relax this assumption and analyze the performance and the proposed scheduling strategies are flexible in adapting to such relaxed assumptions as mentioned in the discussions section. For cases with front-ends and without front-ends, we propose a scheduling strategy, PPDD algorithm, to achieve a near-optimal processing time of all loads. Several significant properties of PPDD algorithm are proven in lemmas and detailed analysis of time performance of PPDD algorithm was conducted. The above analysis is also extended to homogeneous systems wherein we have shown that the time performance of PPDD algorithm with respect to various communication-computation ratios. To implement the load distribution strategy obtained through PPDD algorithm, we proposed a simple load communication strategy. It was demonstrated that the overall processing time obtained using PPDD algorithm is sufficiently close to the result following the actual load communication strategy proposed. To further demonstrate the efficiency of PPDD algorithm, we also compared the time performance of PPDD algorithm with another algorithm, Round-robin Scheduling Algorithm (RSA). It is shown that the proposed PPDD algorithm produces better scheduling solutions than RSA. Detailed discussions and comparisons are carried out. The proposed load scheduling strategies can be readily extended to other network topologies in a similar way. Another interesting extension is to further study the case with multiple load arrivals at each processor, which models dynamic scheduling scenarios in grid or cloud computing environments. Acknowledgements The authors would like to thank the editors and referees for their valuable suggestions, which have significantly helped improve the quality and presentation of this paper. The research presented in this paper is supported in part by US National Science Foundation grant CNS Appendix Proof of Lemma 1 Since L 1 E 1 L i E i,for i 2andC i < E j for any i, j {1,...,m}, asmentionedbefore,wehave, f i = L 1E 1 L i E i C i + E i 0, i = 1, 2,...,K, (12) g i = C 1 + E 1 C i + E i > 0, i = 1, 2,...,K, (13) f i = L 1E 1 L i E i C i E i 0, i = K + 1,K + 2,...,m, (14) g i = C 1 + E 1 C i E i < 0, i = K + 1,K + 2,...,m (15) Since the loads are unbalanced, f i 0, for some i and f i 0, for some i. Fromtheclosed-formexpressionof L 1 given in (8) andusingtheaboveinequalities,weimmediately see that L 1 > 0. Similarly, by expressing all L i,i= 1, 2,...,m 1intermsof L m and through similar algebraic manipulations as we prove L 1 > 0, we can

15 Cluster Comput (2010) 13: also show that L m > 0. Therefore, there is at least one sender and one receiver available always. Now, we shall prove the second part of the lemma. For the receivers p i, i = 1,...,K,ateachiterationintheaboveload distribution strategy, we naturally have (L 1 + L 1 )E 1 + L 1 C 1 >L i E i, i = 1,...,K,whereLHSisthenewfinish time of processor p 1 after load distribution and RHS is the earlier finish time of processor p i S under.thisisbecause we attempt to balance the loads among all the processors by extending (stretching) the processing time of processors in the receiver set and reducing (shrinking) the processing time of processors in the sender set in such a way that they finish processing at the same time. Thus, from (5), we have, L i = (L 1 + L 1 )E 1 + L 1 C 1 L i E i C i + E i > 0, i = 2,...,K (16) Hence the proof. Proof of Lemma 2 This can be directly seen from an inherent property of the load distribution strategy proposed. Assume that we fail to determine a near-optimal solution satisfying the condition that L i > 0, i = 1, 2,...,m,inallprevious iterations from K to m 1. The last iteration cannot fail, since from Lemma 1 we have L m > 0andalltheother processors become potential receivers which have L i > 0, i = 1, 2,...,m 1. Hence the proof. Proof of Lemma 3 For the receiver set, from (5), we have, L i = L i+1 + (L i+1e L i E), C + E i = 1, 2,...,K 1 (17) Since L i E L i+1 E, i = 1, 2,...,K 1, from the above equation, we can immediately obtain, L i L i+1, i = 1, 2,...,K 1. For the sender set, from (6), we have, L i = L i+1 + L i+1e L i E, C E i = K + 1,...,m 1 (18) Note that (C E) < 0intheaboveequationasmentioned in Sect. 2 as per the assumptions. Since L i E L i+1 E, i = K + 1,...,m 1, from the above equation, we obtain, L i L i+1, i = K + 1,...,m 1. Hence the proof. References 1. Kesselman, C., Foster, I. (eds.): The Grid 2: Blueprint for a New Computing Infrastructure by. Morgan Kaufmann, San Mateo (2003) 2. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach, 4th edn. Morgan Kaufmann, San Mateo (2006) 3. Xu, K., Hwang, Z.: Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, New York (1998) 4. Eshaghian, M. (ed.): Heterogeneous Computing. Artech House, Norwood (1996) 5. Drozdowski, M.: Selected Problems of Scheduling Tasks in Multiprocessor Computer Systems. University of Technology Press, Poznan (1997) 6. Veeravalli, B., Ghose, D., Mani, V., Robertazzi, T. (eds.): Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos, (1996) 7. Shirazi, B., Hurson, A., Kavi, K. (eds.): Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos (1995) 8. Veeravalli, B., Ghose, D., Robertazzi, T.: Divisible load theory: a new paradigm for load scheduling in distributed systems. Clust. Comput. Div. Load Sched. 6(1), 7 18 (2003). Special Issue 9. Robertazzi, T., Sohn, J.: Optimal time-varying load sharing divisible jobs. IEEE Trans. Aerospace Electronic Syst. 34, (1998) 10. Sohn, J., Robertazzi, T., Luryi, S.: Optimizing computing costs using divisible load analysis. IEEE Trans. Parallel Distrib. Syst. 9(3), (1998) 11. Marchal, L., Yang, Y., Casanova, H., Robert, Y.: A realistic network/application model for scheduling divisible loads on largescale platforms. In: International Parallel and Distributed Processing Symposium (IPDPS 2005), Viswanathan, S., Veeravalli, B., Yu, D., Robertazzi, T.: Design and analysis of a dynamic scheduling strategy with resource estimation for large-scale grid systems. In: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing (held in conjunction with Supercomputing 2004), Pittsburgh, Pennsylvania, USA, Nov. 2004, pp Li, X., Veeravalli, B., Ko, C.: Distributed image processing on a network of workstations. Int. J. Comput. Appl. 25(2), 1 10 (2003) 14. Veeravalli, B., Ghose, D., Robertazzi, T.: A new paradigm for load scheduling in distributed systems. Divisible Load Sched. Clust. Comput. 6(1), 7 18 (2003). Special issue 15. Robertazzi, T.: Ten reasons to use divisible load theory. IEEE Comput. 36(5), (2003) 16. Ghose, D., Kim, H.J.: Load partitioning and trade-off study for large matrix-vector computations in multicast bus networks with communication delays. J. Parallel Distrib. Comput. 55(1), (1998) 17. Veeravalli, B., Barlas, G.: Efficient scheduling strategies for processing multiple divisible loads on bus networks. J. Parallel Distrib. Comput. 62(1), (2002) 18. Wong, H., Veeravalli, B.: Scheduling divisible loads on heterogeneous linear daisy chain networks with arbitrary processor release times. IEEE Trans. Parallel Distrib. Syst. 15(3), (2005) 19. Drozdowski, M., Blazewicz, J.: Distributed processing of divisible jobs with communication startup costs. Discrete Appl. Math. 76(1 3), (1997) 20. Veeravalli, B., Li, X., Ko, C.C.: On the influence of start-up costs in scheduling divisible loads on bus networks. IEEE Trans. Parallel Distrib. Syst. 11(12), (2000) 21. Li, X., Veeravalli, B., Ko, C.: Divisible load scheduling on singlelevel tree networks with buffer constraints. IEEE Trans. Aerospace Electronic Syst. 36(4), (2000) 22. Chan, S., Veeravalli, B., Ghose, D.: Large matrix-vector products on distributed bus networks with communication delays using the divisible load paradigm: Performance analysis and simulation. Math. Comput. Simul. 58, 71 79(2001) 23. Wolniewicz, P., Drozdowski, M.: Experiments with scheduling divisible tasks in clusters of workstations. In: Proceedings of the

16 46 Cluster Comput (2010) 13: Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 2000, pp Ghose, D., Kim, H.J., Kim, T.H.: Adaptive divisible load scheduling strategies for workstation clusters with unknown network resources. IEEE Trans. Parallel Distrib. Syst. 16(10), (2005) 25. Beaumont, O., Casanova, H., Legrand, A., Robert, Y., Yang, Y.: Scheduling divisible loads on star and tree networks: results and open problems. IEEE Trans. Parallel Distrib. Syst. 16(3), (2005) 26. Min, W.H., Veeravalli, B.: Aligning biological sequences on distributed bus networks: a divisible load scheduling approach. IEEE Trans. Inf. Technol. Biomed. 9(4), (2005) 27. Yao, J., Guo, J., Bhuyan, L., Xu, Z.: Scheduling real-time multimedia tasks in network processors. In: IEEE Global Telecommunications Conference (GLOBECOM 04), vol. 3, Li, X., Cao, J.: Coordinated workload scheduling in hierarchical sensor networks for data fusion applications. J. Comput. Sci. Technol. 23(3), (2008) 29. Moges, M., Robertazzi, T.G.: Wireless sensor networks: scheduling for measurement and data reporting. IEEE Trans. Aerospace Electronic Syst. 42(1), (2006) 30. Carroll, T.E., Grosu, D.: A strategyproof mechanism for scheduling divisible loads in tree networks. In: Proc. of the 20th IEEE Intl. Parallel and Distributed Processing Symp. (IPDPS 2006), Carroll, T.E., Grosu, D.: Strategyproof mechanisms for scheduling divisible loads in Bus-Networked distributed systems. IEEE Trans. Parallel Distrib. Syst. 19(8), (2008) 32. Robertazzi, T., Sohn, J.: A multi-job load sharing strategy for divisible jobs on bus networks. In: Proceedings of the Conference on Information Sciences and Systems, Princeton, NJ, March Veeravalli, B., Yao, J.: Design and performance analysis of divisible load scheduling strategies on arbitrary graphs. Clust. Comput. 7(2), (2004) 34. Haddad, E.: Real-time optimization of distributed load balancing. In: Proceedings of the Second Workshop on Parallel and Distributed Real-Time Systems, 1994, pp Robertazzi, T., Lammie, T.: A linear daisy chain with two divisible load sources. In: 2005 Conference on Information Sciences and Systems, The Johns Hopkins University, Baltimore, Maryland, March Wong, H.M., Yu, D., Veeravalli, B., Robertazzi, T.: Data intensive grid scheduling: multiple sources with capacity constraints. In: Fifteenth IASTED International Conference on Parallel and Distributed Computing and Systems, vol. 1, 2003, pp Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke, S.: The data grid: Towards an architecture for the distributed management and analysis of arge scientific datasets. J. Netw. Comput. Appl. 23, (2001) 38. Shivaratri, N., Krueger, P., Singhal, M.: Load distributing for locally distributed systems. Computer 25(12), (1992) 39. Gallager, D., Bertsekas, R. (eds.): Data Networks, 2nd edn. Prentice Hall, New York (1992) 40. Luszczek, P., Dongarra, J.: Introduction to the hpcchallenge benchmark suite. University of Tennessee, Tech. Rep. ICL-UT (2005) 41. Lee, C., Hamdi, M.: Parallel image processing applications on a network of workstations. Parallel Comput. 21, (1995) Xiaolin Li is Assistant Professor in Computer Science Department at Oklahoma State University. His research interests include Parallel and Distributed Systems, Cyber- Physical Systems, and Network Security. His research has been sponsored by several external grants, including US National Science Foundation (NSF) (PetaApps, GENI, CRI, and MRI programs), Department of Homeland Security (DHS), Oklahoma Center for the Advancement of Science and Technology (OCAST), Oklahoma Transportation Center (OTC), and industry partners. He is an associate editor of three international journals and a program chair for over 10 international conferences and workshops. He is on the executive committee of IEEE Technical Committee on Scalable Computing (TCSC) and a panelist for NSF. He has been a TPC member for numerous international conferences, including INFOCOM, GlobeCom, ICC, CCGrid, MASS, and ICPADS. He received the Ph.D. degree in Communications and Information Engineering from National University of Singapore, Singapore, and the Ph.D. degree in Computer Engineering from Rutgers University, USA. He is directing the Scalable Software Systems Laboratory ( He is a member of IEEE and ACM. Bharadwaj Veeravalli received his B.Sc. in Physics, from Madurai- Kamaraj University, India in 1987, Master s in Electrical Communication Engineering from Indian Institute of Science, Bangalore, India in 1991 and Ph.D. from Department of Aerospace Engineering, Indian Institute of Science, Bangalore, India in He did his post-doctoral research in the Department of Computer Science, Concordia University, Montreal, Canada, in He is with the Department of Electrical and Computer Engineering, Communications and Information Engineering (CIE) division, at The National University of Singapore, Singapore, as a tenured Associate Professor. His main stream research interests include, Multiprocessor systems, Cluster/Grid/Cloud computing, Scheduling in parallel and distributed systems, Bioinformatics & Computational Biology, and Multimedia computing. He is one of the earliest researchers in the field of divisible load theory (DLT). He has published over 65 papers in high-quality International Journals(Conferences). He had successfully secured several externally funded projects. He has co-authored three research monographs in the areas of PDS, Distributed Databases(competitive algorithms), and Networked Multimedia Systems, in the years 1996, 2003, and 2005, respectively. He had guest edited a special issue on Cluster/Grid Computing for IJCA, USA journal in He had served as a program committee member and as a Session Chair in several International Conferences. He is currently serving the Editorial Board of IEEE Transactions on Computers, IEEE Transactions on SMC-A, and Multimedia Tools & Applications (MTAP), USA, as an Associate Editor. He is a Senior Member of IEEE & IEEE- CS. Bharadwaj Veeravalli s complete academic career profile can be found in

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Design and Analysis of a Dynamic Scheduling Strategy with Resource Estimation for Large-Scale Grid Systems

Design and Analysis of a Dynamic Scheduling Strategy with Resource Estimation for Large-Scale Grid Systems Design and Analysis of a Dynamic Scheduling Strategy with Resource Estimation for Large-Scale Grid Systems Sivakumar Viswanathan, Bharadwa Veeravalli Department of Electrical and Computer Engineering,

More information

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 8, NO. 6, DECEMBER 2000 747 A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks Yuhong Zhu, George N. Rouskas, Member,

More information

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Scheduling Algorithms to Minimize Session Delays

Scheduling Algorithms to Minimize Session Delays Scheduling Algorithms to Minimize Session Delays Nandita Dukkipati and David Gutierrez A Motivation I INTRODUCTION TCP flows constitute the majority of the traffic volume in the Internet today Most of

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

CHAPTER 7 CONCLUSION AND FUTURE SCOPE 121 CHAPTER 7 CONCLUSION AND FUTURE SCOPE This research has addressed the issues of grid scheduling, load balancing and fault tolerance for large scale computational grids. To investigate the solution

More information

MOST attention in the literature of network codes has

MOST attention in the literature of network codes has 3862 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 Efficient Network Code Design for Cyclic Networks Elona Erez, Member, IEEE, and Meir Feder, Fellow, IEEE Abstract This paper introduces

More information

Approximate Linear Programming for Average-Cost Dynamic Programming

Approximate Linear Programming for Average-Cost Dynamic Programming Approximate Linear Programming for Average-Cost Dynamic Programming Daniela Pucci de Farias IBM Almaden Research Center 65 Harry Road, San Jose, CA 51 pucci@mitedu Benjamin Van Roy Department of Management

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

CHAPTER 5 PROPAGATION DELAY

CHAPTER 5 PROPAGATION DELAY 98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Load Balancing in Heterogeneous Systems

Load Balancing in Heterogeneous Systems Load Balancing in Heterogeneous Systems P.Neelakantan Department of CSE VNR VJIET, Hyderabad Abstract The grid and cluster computing uses interconnected nodes to solve a problem in parallel in order to

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

Optimal network flow allocation

Optimal network flow allocation Optimal network flow allocation EE384Y Project intermediate report Almir Mutapcic and Primoz Skraba Stanford University, Spring 2003-04 May 10, 2004 Contents 1 Introduction 2 2 Background 2 3 Problem statement

More information

Analysis of Binary Adjustment Algorithms in Fair Heterogeneous Networks

Analysis of Binary Adjustment Algorithms in Fair Heterogeneous Networks Analysis of Binary Adjustment Algorithms in Fair Heterogeneous Networks Sergey Gorinsky Harrick Vin Technical Report TR2000-32 Department of Computer Sciences, University of Texas at Austin Taylor Hall

More information

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori The Computer Journal, 46(6, c British Computer Society 2003; all rights reserved Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes Tori KEQIN LI Department of Computer Science,

More information

OPTICAL NETWORKS. Virtual Topology Design. A. Gençata İTÜ, Dept. Computer Engineering 2005

OPTICAL NETWORKS. Virtual Topology Design. A. Gençata İTÜ, Dept. Computer Engineering 2005 OPTICAL NETWORKS Virtual Topology Design A. Gençata İTÜ, Dept. Computer Engineering 2005 Virtual Topology A lightpath provides single-hop communication between any two nodes, which could be far apart in

More information

Worst-case Ethernet Network Latency for Shaped Sources

Worst-case Ethernet Network Latency for Shaped Sources Worst-case Ethernet Network Latency for Shaped Sources Max Azarov, SMSC 7th October 2005 Contents For 802.3 ResE study group 1 Worst-case latency theorem 1 1.1 Assumptions.............................

More information

Mapping pipeline skeletons onto heterogeneous platforms

Mapping pipeline skeletons onto heterogeneous platforms Mapping pipeline skeletons onto heterogeneous platforms Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon January 2007 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

CS 204 Lecture Notes on Elementary Network Analysis

CS 204 Lecture Notes on Elementary Network Analysis CS 204 Lecture Notes on Elementary Network Analysis Mart Molle Department of Computer Science and Engineering University of California, Riverside CA 92521 mart@cs.ucr.edu October 18, 2006 1 First-Order

More information

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Lecture 16 Cutting Plane Algorithm We shall continue the discussion on integer programming,

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

ARELAY network consists of a pair of source and destination

ARELAY network consists of a pair of source and destination 158 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 55, NO 1, JANUARY 2009 Parity Forwarding for Multiple-Relay Networks Peyman Razaghi, Student Member, IEEE, Wei Yu, Senior Member, IEEE Abstract This paper

More information

Scheduling with Bus Access Optimization for Distributed Embedded Systems

Scheduling with Bus Access Optimization for Distributed Embedded Systems 472 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 5, OCTOBER 2000 Scheduling with Bus Access Optimization for Distributed Embedded Systems Petru Eles, Member, IEEE, Alex

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Scheduling Unsplittable Flows Using Parallel Switches

Scheduling Unsplittable Flows Using Parallel Switches Scheduling Unsplittable Flows Using Parallel Switches Saad Mneimneh, Kai-Yeung Siu Massachusetts Institute of Technology 77 Massachusetts Avenue Room -07, Cambridge, MA 039 Abstract We address the problem

More information

Improved Load Balancing in Distributed Service Architectures

Improved Load Balancing in Distributed Service Architectures Improved Load Balancing in Distributed Service Architectures LI-CHOO CHEN, JASVAN LOGESWAN, AND AZIAH ALI Faculty of Engineering, Multimedia University, 631 Cyberjaya, MALAYSIA. Abstract: - The advancement

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation

You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation Unit 5 SIMULATION THEORY Lesson 39 Learning objective: To learn random number generation. Methods of simulation. Monte Carlo method of simulation You ve already read basics of simulation now I will be

More information

A Real Coded Genetic Algorithm for Data Partitioning and Scheduling in Networks with Arbitrary Processor Release Time

A Real Coded Genetic Algorithm for Data Partitioning and Scheduling in Networks with Arbitrary Processor Release Time A Real Coded Genetic Algorithm for Data Partitioning and Scheduling in Networks with Arbitrary Processor Release Time S. Suresh 1, V. Mani 1, S. N. Omkar 1, and H. J. Kim 2 1 Department of Aerospace Engineering,

More information

Driven Cavity Example

Driven Cavity Example BMAppendixI.qxd 11/14/12 6:55 PM Page I-1 I CFD Driven Cavity Example I.1 Problem One of the classic benchmarks in CFD is the driven cavity problem. Consider steady, incompressible, viscous flow in a square

More information

206 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 1, FEBRUARY The RGA arbitration can also start from the output side like in DRR [13] and

206 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 1, FEBRUARY The RGA arbitration can also start from the output side like in DRR [13] and 206 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 1, FEBRUARY 2008 Matching From the First Iteration: An Iterative Switching Algorithm for an Input Queued Switch Saad Mneimneh Abstract An iterative

More information

The Cross-Entropy Method

The Cross-Entropy Method The Cross-Entropy Method Guy Weichenberg 7 September 2003 Introduction This report is a summary of the theory underlying the Cross-Entropy (CE) method, as discussed in the tutorial by de Boer, Kroese,

More information

Lecture notes on Transportation and Assignment Problem (BBE (H) QTM paper of Delhi University)

Lecture notes on Transportation and Assignment Problem (BBE (H) QTM paper of Delhi University) Transportation and Assignment Problems The transportation model is a special class of linear programs. It received this name because many of its applications involve determining how to optimally transport

More information

IMPROVING THE DATA COLLECTION RATE IN WIRELESS SENSOR NETWORKS BY USING THE MOBILE RELAYS

IMPROVING THE DATA COLLECTION RATE IN WIRELESS SENSOR NETWORKS BY USING THE MOBILE RELAYS IMPROVING THE DATA COLLECTION RATE IN WIRELESS SENSOR NETWORKS BY USING THE MOBILE RELAYS 1 K MADHURI, 2 J.KRISHNA, 3 C.SIVABALAJI II M.Tech CSE, AITS, Asst Professor CSE, AITS, Asst Professor CSE, NIST

More information

1. INTRODUCTION light tree First Generation Second Generation Third Generation

1. INTRODUCTION light tree First Generation Second Generation Third Generation 1. INTRODUCTION Today, there is a general consensus that, in the near future, wide area networks (WAN)(such as, a nation wide backbone network) will be based on Wavelength Division Multiplexed (WDM) optical

More information

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize. Cornell University, Fall 2017 CS 6820: Algorithms Lecture notes on the simplex method September 2017 1 The Simplex Method We will present an algorithm to solve linear programs of the form maximize subject

More information

Computer Science Technical Report

Computer Science Technical Report Computer Science Technical Report Feasibility of Stepwise Addition of Multitolerance to High Atomicity Programs Ali Ebnenasir and Sandeep S. Kulkarni Michigan Technological University Computer Science

More information

Optimization I : Brute force and Greedy strategy

Optimization I : Brute force and Greedy strategy Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean

More information

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort

More information

End-to-end bandwidth guarantees through fair local spectrum share in wireless ad hoc networks

End-to-end bandwidth guarantees through fair local spectrum share in wireless ad hoc networks End-to-end bandwidth guarantees through fair local spectrum share in wireless ad hoc networks Saswati Sarkar and Leandros Tassiulas 1 Abstract Sharing the common spectrum among the links in a vicinity

More information

Linear Programming. Meaning of Linear Programming. Basic Terminology

Linear Programming. Meaning of Linear Programming. Basic Terminology Linear Programming Linear Programming (LP) is a versatile technique for assigning a fixed amount of resources among competing factors, in such a way that some objective is optimized and other defined conditions

More information

OPTIMAL DYNAMIC LOAD BALANCE IN DISTRIBUTED SYSTEMS FOR CLIENT SERVER ASSIGNMENT

OPTIMAL DYNAMIC LOAD BALANCE IN DISTRIBUTED SYSTEMS FOR CLIENT SERVER ASSIGNMENT OPTIMAL DYNAMIC LOAD BALANCE IN DISTRIBUTED SYSTEMS FOR CLIENT SERVER ASSIGNMENT D.SARITHA Department of CS&SE, Andhra University Visakhapatnam, Andhra Pradesh Ch. SATYANANDA REDDY Professor, Department

More information

APHID: Asynchronous Parallel Game-Tree Search

APHID: Asynchronous Parallel Game-Tree Search APHID: Asynchronous Parallel Game-Tree Search Mark G. Brockington and Jonathan Schaeffer Department of Computing Science University of Alberta Edmonton, Alberta T6G 2H1 Canada February 12, 1999 1 Running

More information

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic

More information

Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Complexity results for throughput and latency optimization of replicated and data-parallel workflows Complexity results for throughput and latency optimization of replicated and data-parallel workflows Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon June 2007 Anne.Benoit@ens-lyon.fr

More information

Chapter 4: Implicit Error Detection

Chapter 4: Implicit Error Detection 4. Chpter 5 Chapter 4: Implicit Error Detection Contents 4.1 Introduction... 4-2 4.2 Network error correction... 4-2 4.3 Implicit error detection... 4-3 4.4 Mathematical model... 4-6 4.5 Simulation setup

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

Revisiting Pipelined Parallelism in Multi-Join Query Processing

Revisiting Pipelined Parallelism in Multi-Join Query Processing Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu Elke A. Rundensteiner Department of Computer Science, Worcester Polytechnic Institute Worcester, MA 01609-2280 (binliu rundenst)@cs.wpi.edu

More information

Precedence Graphs Revisited (Again)

Precedence Graphs Revisited (Again) Precedence Graphs Revisited (Again) [i,i+6) [i+6,i+12) T 2 [i,i+6) [i+6,i+12) T 3 [i,i+2) [i+2,i+4) [i+4,i+6) [i+6,i+8) T 4 [i,i+1) [i+1,i+2) [i+2,i+3) [i+3,i+4) [i+4,i+5) [i+5,i+6) [i+6,i+7) T 5 [i,i+1)

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Stochastic Control of Path Optimization for Inter-Switch Handoffs in Wireless ATM Networks

Stochastic Control of Path Optimization for Inter-Switch Handoffs in Wireless ATM Networks 336 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 9, NO. 3, JUNE 2001 Stochastic Control of Path Optimization for Inter-Switch Handoffs in Wireless ATM Networks Vincent W. S. Wong, Member, IEEE, Mark E. Lewis,

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon September 2007 Anne.Benoit@ens-lyon.fr

More information

Algorithms for Provisioning Virtual Private Networks in the Hose Model

Algorithms for Provisioning Virtual Private Networks in the Hose Model IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 10, NO 4, AUGUST 2002 565 Algorithms for Provisioning Virtual Private Networks in the Hose Model Amit Kumar, Rajeev Rastogi, Avi Silberschatz, Fellow, IEEE, and

More information

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments 1 A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments E. M. Karanikolaou and M. P. Bekakos Laboratory of Digital Systems, Department of Electrical and Computer Engineering,

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Byzantine Consensus in Directed Graphs

Byzantine Consensus in Directed Graphs Byzantine Consensus in Directed Graphs Lewis Tseng 1,3, and Nitin Vaidya 2,3 1 Department of Computer Science, 2 Department of Electrical and Computer Engineering, and 3 Coordinated Science Laboratory

More information

x ji = s i, i N, (1.1)

x ji = s i, i N, (1.1) Dual Ascent Methods. DUAL ASCENT In this chapter we focus on the minimum cost flow problem minimize subject to (i,j) A {j (i,j) A} a ij x ij x ij {j (j,i) A} (MCF) x ji = s i, i N, (.) b ij x ij c ij,

More information

Randomized algorithms have several advantages over deterministic ones. We discuss them here:

Randomized algorithms have several advantages over deterministic ones. We discuss them here: CS787: Advanced Algorithms Lecture 6: Randomized Algorithms In this lecture we introduce randomized algorithms. We will begin by motivating the use of randomized algorithms through a few examples. Then

More information

Scheduling Strategies for Processing Continuous Queries Over Streams

Scheduling Strategies for Processing Continuous Queries Over Streams Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 Scheduling Strategies for Processing Continuous Queries Over Streams Qingchun Jiang, Sharma Chakravarthy

More information

Solution for Homework set 3

Solution for Homework set 3 TTIC 300 and CMSC 37000 Algorithms Winter 07 Solution for Homework set 3 Question (0 points) We are given a directed graph G = (V, E), with two special vertices s and t, and non-negative integral capacities

More information

Prices and Auctions in Markets with Complex Constraints

Prices and Auctions in Markets with Complex Constraints Conference on Frontiers of Economics and Computer Science Becker-Friedman Institute Prices and Auctions in Markets with Complex Constraints Paul Milgrom Stanford University & Auctionomics August 2016 1

More information

1158 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 18, NO. 4, AUGUST Coding-oblivious routing implies that routing decisions are not made based

1158 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 18, NO. 4, AUGUST Coding-oblivious routing implies that routing decisions are not made based 1158 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 18, NO. 4, AUGUST 2010 Network Coding-Aware Routing in Wireless Networks Sudipta Sengupta, Senior Member, IEEE, Shravan Rayanchu, and Suman Banerjee, Member,

More information

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 2, APRIL Segment-Based Streaming Media Proxy: Modeling and Optimization

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 2, APRIL Segment-Based Streaming Media Proxy: Modeling and Optimization IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 2, APRIL 2006 243 Segment-Based Streaming Media Proxy: Modeling Optimization Songqing Chen, Member, IEEE, Bo Shen, Senior Member, IEEE, Susie Wee, Xiaodong

More information

CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 14: Combinatorial Problems as Linear Programs I. Instructor: Shaddin Dughmi

CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 14: Combinatorial Problems as Linear Programs I. Instructor: Shaddin Dughmi CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 14: Combinatorial Problems as Linear Programs I Instructor: Shaddin Dughmi Announcements Posted solutions to HW1 Today: Combinatorial problems

More information

Chapter 5: Analytical Modelling of Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel

More information

Lecture 19. Lecturer: Aleksander Mądry Scribes: Chidambaram Annamalai and Carsten Moldenhauer

Lecture 19. Lecturer: Aleksander Mądry Scribes: Chidambaram Annamalai and Carsten Moldenhauer CS-621 Theory Gems November 21, 2012 Lecture 19 Lecturer: Aleksander Mądry Scribes: Chidambaram Annamalai and Carsten Moldenhauer 1 Introduction We continue our exploration of streaming algorithms. First,

More information

Module 7. Independent sets, coverings. and matchings. Contents

Module 7. Independent sets, coverings. and matchings. Contents Module 7 Independent sets, coverings Contents and matchings 7.1 Introduction.......................... 152 7.2 Independent sets and coverings: basic equations..... 152 7.3 Matchings in bipartite graphs................

More information

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows DOI 10.1007/s00453-008-9229-4 Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows Anne Benoit Yves Robert Received: 6 April 2007 / Accepted: 19 September

More information

Handout 9: Imperative Programs and State

Handout 9: Imperative Programs and State 06-02552 Princ. of Progr. Languages (and Extended ) The University of Birmingham Spring Semester 2016-17 School of Computer Science c Uday Reddy2016-17 Handout 9: Imperative Programs and State Imperative

More information

Achieve Significant Throughput Gains in Wireless Networks with Large Delay-Bandwidth Product

Achieve Significant Throughput Gains in Wireless Networks with Large Delay-Bandwidth Product Available online at www.sciencedirect.com ScienceDirect IERI Procedia 10 (2014 ) 153 159 2014 International Conference on Future Information Engineering Achieve Significant Throughput Gains in Wireless

More information

Stability Analysis of a Window-based Flow Control Mechanism for TCP Connections with Different Propagation Delays

Stability Analysis of a Window-based Flow Control Mechanism for TCP Connections with Different Propagation Delays Stability Analysis of a Window-based Flow Control Mechanism for TCP Connections with Different Propagation Delays Keiichi Takagaki Hiroyuki Ohsaki Masayuki Murata Graduate School of Engineering Science,

More information

A Distributed Algorithm for the Replica Placement Problem

A Distributed Algorithm for the Replica Placement Problem IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS A Distributed Algorithm for the Replica Placement Problem Sharrukh Zaman, Student Member, IEEE, and Daniel Grosu, Senior Member, IEEE Abstract Caching

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Lecture 2 - Introduction to Polytopes

Lecture 2 - Introduction to Polytopes Lecture 2 - Introduction to Polytopes Optimization and Approximation - ENS M1 Nicolas Bousquet 1 Reminder of Linear Algebra definitions Let x 1,..., x m be points in R n and λ 1,..., λ m be real numbers.

More information

Scheduling of Multiple Applications in Wireless Sensor Networks Using Knowledge of Applications and Network

Scheduling of Multiple Applications in Wireless Sensor Networks Using Knowledge of Applications and Network International Journal of Information and Computer Science (IJICS) Volume 5, 2016 doi: 10.14355/ijics.2016.05.002 www.iji-cs.org Scheduling of Multiple Applications in Wireless Sensor Networks Using Knowledge

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information