Scheduling Independent Tasks in Heterogeneous Environments under Communication Constraints

Schedulng Independent Tasks n Heterogeneous Envronments under Communcaton Constrants Petros Lampsas 1 Thanass Loukopoulos 2 Fedon Dmopoulos 1 Marouso Athanasou 1 1 Department of Informatcs and Computer Technology Technologcal Educatonal Insttute of Lama 3rd km Old Natonal Road, 35100 Lama, Greece {plam, crrus, maryse}@nf.telam.gr 2 Department of Computer and Communcatons Eng. Unversty of Thessaly Glavan 37, 38221 Volos, Greece luke@nf.uth.gr Abstract Wth the advent of the Grd, task schedulng n heterogeneous envronments becomes more and more mportant. Of partcular nterest s the fact that especally n scentfc experments a non neglgble amount of data must be transferred to the processng node before a task can commence executon. Gven bandwdth constrants, schedulng both computatons and data transfers s requred. In ths paper we frst develop a sutable model that captures heterogenety n the processng nodes whle mposng communcaton constrants. We proceed by proposng schedulng heurstcs wth the am of mnmzng the total makespan of a set of ndependent tasks. Through a seres of experments we llustrate the potental of a partcular heurstc that s based on backfllng. 1. Introducton The task schedulng problem s of paramount mportance n the felds of parallel systems and computaton Grds [14]. A generc descrpton of t s as follows: gven a set of tasks and a set of processng nodes, map the tasks to the nodes and order ther executon, n order to mnmze the total tme requred to execute all tasks. An avd lterature exsts on the problem nvestgatng dfferent aspects of t. Typcally, n parallel systems processng nodes are assumed to be homogeneous [4] whle n cluster and Grd computng the focus s usually on heterogeneous nodes [2]. Another dstngushng factor s whether tasks are assumed to have nterdependences or not (Drected Acyclc Graph - DAG schedulng [7]). In ths paper we deal wth schedulng ndependent tasks n a heterogeneous dstrbuted envronment wth the man focus beng computatonal Grds. Our model captures heterogenety n two ways. The frst regards the processng power of each node and the dfferent resource levels avalable at them. The second concerns communcaton capabltes. Contrary to most of the exstng works, we place partcular emphass on the latter aspect snce the data needed to be transferred for task executon n a computatonal Grd are often extremely large. Tradtonal formulatons assume that all data transfers may commence at the same tme, whle the transfer delay s always lnear to the avalable bandwdth and the sze of the transferred data. Ths s less than adequate snce t fals to capture congeston ssues [13]. Instead, we follow an alternatve whereby the number of transfers each node may partcpate at s restrcted. Our contrbutons are threefold. Frst, we develop a schedulng model that ncorporates the man parameters of a dstrbuted heterogeneous envronment,.e., node processng power, resource and communcaton constrants. Second, we adapt n our model two popular lst schedulng heurstcs [2] (MaxMn, MnMn). In dong so, we nvestgate dfferent canddates for the task weght functon. Fnally, based on how MaxMn and MnMn work, we develop a new heurstc called BackflledMaxMn (BMaxMn) that combnes the strengths of both algorthms. The rest of the paper s organzed as follows. Secton 2 presents the schedulng model. Secton 3 llustrates the heurstcs whch are expermentally evaluated n Secton 4. Related work and drectons for future research are dscussed n Secton 5. Fnally, Secton 6 concludes the paper.

2. Schedulng model Assume there exsts M ndependent tasks (denoted by T k ) that must be scheduled at the N nodes (let N ) of a dstrbuted system. The executon of a task T k at N requres certan data (let d k be the sze), as well as a certan amount of resources (e.g., processors, memory etc.) to be avalable at N. Let r denote the total number of dstnct resources we are nterested at, R x the level of the xth resource at N and r kx the amount of the xth resource requred by T k (meanng that n case T k commences ts executon at N the remanng resource level R x at N wll be decreased by r kx for each dstnct resource x). Task data orgnate from sources whch are connected wth the processng nodes through a communcaton network. Let S be an N M matrx whereby an element S k = 1 ff N s a source for T k s data and 0 otherwse. In case a task s data source s dfferent from the processng node t s assgned, the correspondng data need to be transferred. We assume that a node N can partcpate n at most K smultaneous transfers. Each of the K transfers s performed at a rate a j (where a j s the tme requred to transfer 1 data unt from N j to N and a = a ). Notce that our approach n modelng transfer delays s smlar to the ones n [1] and [11]. Fgure 1. An example of 4 data transfers. In case both the source N j and the destnaton j j N of a data transfer for T k have not reached ther capactes ( K j and K respectvely), the data transfer wll be completed after a jdk tme. Otherwse, we must wat untl one of the ongong transfers s completed before T k s data transfer can commence. Fg. 1 llustrates the concept assumng a node N must perform 4 transfers, from a source N that has unlmted transfer capacty. Fg. 1(a) shows the case where K = 4 and thus all transfers can commence smultaneously. Fg. 1(b) and 1(c) depct two possble j transfer schedules when K = 2, of whch, the one n Fg. 1(c) s optmal. Once data has arrved at the ntended node, task executon wll start, provded the requred resources are avalable. We assume that an estmaton of the executon tme (let e k ) for each T k s readly avalable. The actual runnng tme of a task executon s dependent on the node where t s assgned and s gven by: e k /speedup( N ), where speedup( N ) depcts how much faster the tasks wll execute at a node N as compared to the base system whereby the estmaton of e k was based upon. Smlar assumptons to capture heterogenety were used n [2], [7], [9], [10] to name a few. Our goal s to schedule the M tasks to the avalable nodes n order to mnmze the total makespan. It can be shown that the relevant decson problem s NPcomplete (by reducton to the two-processor schedulng problem [6]). In the sequel we present varous heurstcs to tackle t. 3. Schedulng heurstcs The heurstcs examned here fall nto two categores: () lst-based [2], () backfll-based [5], [16]. Lst based heurstcs order the tasks accordng to a weght functon and schedule them ether n ncreasng or decreasng order. In ths category we examne the MaxMn and MnMn algorthms. The dea behnd backfllng s to fll gaps n the schedule usng small tasks. We dscuss how to mplement ths premse for the partcular case of MaxMn (BackflledMaxMn heurstc). MaxMn heurstc: In the MaxMn heurstc [2] we frst select the maxmum weght task (Max step) and schedule t to the node where ts completon tme wll be mnmum (Mn step). We proceed by greedly schedulng the remanng tasks n decreasng order of weght. Notce that calculatng the completon tme of a task executon nvolves selectng among (possbly) multple sources to download the relevant data. The source that mnmzes the transfer completon tme s selected, whch s not necessarly the one wth mn a due to the fact that t mght already be occuped wth other transfers. The ratonale behnd MaxMn, s that snce the largest tasks domnate the total makespan, schedulng them at the earlest possble tme wll lkely mnmze the overall completon tme. Decdng task weghts requres further attenton. In other related problem domans, e.g., cluster schedulng [2], task weghts (let W k ) usually equal estmated j

executon tmes ( e k s). Thus, weghts are monotone, n the sense that whenever two tasks T k and T k are consdered for schedulng at the same node and Wk W k, T k wll fnsh earler than T k (regardless of the nspected node). However, when data transfers are consdered, fndng a monotonc weght defnton s mpossble, snce transfer tme s dependent on the source-destnaton par. Thus, t mght be the case that T k wll execute faster than T k at a node N, whle at a node N the opposte can happen, f N s closer to T k s sources than to T k s. The followng weght functons are consdered n the paper: W 1 =,.e., the data sze. k d k W 2 k = d k ( S jk ( aj / N)) / S jk,.e., the j j average transfer tme calculated by takng all possble sources and destnatons of T k. W 3 k = e k,.e., the expected executon tme. W 4k = ek /( speedup( N ) / N),.e., the average executon tme. W 5 k = W 2k + W 4k,.e., the average transfer tme together wth the average executon tme. MnMn heurstc: The MnMn heurstc [2] s symmetrc to MaxMn wth the only dfference beng that tasks are scheduled n ncreasng order of weght (.e., smallest s pcked frst). The ratonale behnd MnMn s that by mnmzng the average start tme of tasks the overall makespan wll be mnmal. BackflledMaxMn (BMaxMn) heurstc: Ths heurstc ams at mprovng the performance of MaxMn, through better resource utlzaton. Recall, that MaxMn selects the maxmum weght task and assgns t to the most benefcal node. Regardless of the weght defnton, tasks nvolvng large data transfers wll most lkely be hgh n the rank. Ths means that the early stages wll be devoted n data transfers leavng the processng nodes otherwse dle. BMaxMn attempts to overcome the above shortcomng, by only usng a porton of the K avalable channels for large data transfers and the rest for relatvely small. Followng s the pseudocode of the algorthm: BMaxMn() (1) whle ( unscheduled tasks) (2) schedule a task wth MaxMn; //let T k, N be the task node par decded by MaxMn (3) for all unscheduled tasks T x //backfll phase (4) f (executon of T x at N does not conflct wth T k s executon) (5) schedule T x at N ; The algorthm starts by assgnng a task usng MaxMn,.e., the task of largest weght (let T k ) to the node where t wll be completed n the mnmum tme (let N ). Lnes 3-5 s the backfll phase, whereby the algorthm checks f any of the remanng tasks can be assgned to N wthout resultng n a resource conflct wth T k. In Sec. 4 we experment wth two varants of BMaxMn. The frst pcks the tasks for backfllng startng from the bottom of the ordered lst (.e., smallest frst). We call ths varant BMaxMn-Up (BMaxMn-U). The other works n a reverse manner by startng from the top of the lst and scannng downwards (BMaxMn-Down, BMaxMn-D for short). Fgure 2. An example of 4 task assgnments. Fg. 2 llustrates an example run of the three heurstcs, assumng one processng node N wth K = 2 that can only execute one task at a tme and one source used s N wth K = 4. The weght functon j j W1 k. Fg. 2(a) shows the tme requred for data transfer and executon by each task. Fg 2(b) and 2(c) depct the schedules produced by MaxMn and MnMn. Notce the performance dfference between the two, whch can be attrbuted to the fact that MnMn utlzes the avalable node resources much earler compared to MaxMn (Sec. 4 further elaborates the concept). Fg. 2(d) presents the schedule of BMaxMn-D. The algorthm frst assgns the largest task ( T 1 ) and enters the backfll phase. Snce the lst s scanned top-down, T s frst consdered for backfll. However, the 2 executon of T 2 conflcts wth T 1 (recall that only one task can be executed each tme). Therefore T 2 s not backflled and the algorthm proceeds by checkng T 3. T 3 causes no conflcts and s thus, scheduled. The same holds true wth T 4 whch s scheduled after T 3. The algorthm then ends the backfll phase and returns to lne 2 n the pseudocode, where T 2 s assgned.

Notce, that T 2 s executon must be delayed untl after T 1 fnshes. 4. Experments In ths secton we present some expermental results. Sec. 4.1 descrbes the smulaton setup, whle Sec. 4.2 llustrates our fndngs. Due to space lmtatons only a small set of the experments conducted are shown. 4.1. Smulaton setup We assumed a network consstng of 20 processng nodes and 5 source nodes (.e., nodes that only serve as data repostores and perform no computatons). The average transfer rate between the processng nodes and the sources (.e., aj s) vared unformly between 1 and 3 whle the speedup at the processng nodes between 1 and 2. The number of smultaneous transfers a processng node N could partcpate at, was fxed to K = 4, whle for sources t was set to nfnty. Ths was done n order to smulate the case where the avalable bandwdth at the processng nodes s the restrctve factor for performng the data transfers. 72000 70000 68000 66000 64000 62000 60000 data sze avg transfer tme estmated executon tme avg executon tme avg transfer+executon MnMn MaxMn BMaxMn-D BMaxMn_U Fgure 3. Effects of dfferent weght metrcs on the total makespan. We generated a total of 500 tasks wth the followng propertes: () data sze ( d k ) vared unformly between 500 and 5,000 data unts, () estmated executon tme ( e k ) vared between 1,500 and 15,000 sec. We selected the above values n order to equalze (on the average case) the data transfer tme and the processng tme of a sngle task (notce the way speedups and transfer rates were defned). Unless otherwse stated we assumed that each node N can execute n parallel at most two tasks due to resource lmtatons (.e., Rx s). Fnally, one, randomly allocated, copy of the data requred to execute a task was avalable. 4.2. Results Frst, we evaluated the performance of the algorthms under the dfferent weght functons presented n Sec. 3. Fg. 3 llustrates the results. The frst thng to notce s that the effect of the weght metrc s not dentcal to all algorthms. MaxMn and BMaxMn-D behave best when the weght metrc s strongly related to the executon tme. MnMn on the other hand acheves a better performance when the transfer cost s factored n the weght, whle the total makespan of BMaxMn-U tends to reman unaffected. In the remanng experments we adopted the data sze as the weght metrc. 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 x1 x2 x5 x10 Data sze ncrease MnMn MaxMn BMaxMn-D BMaxMn-U Fgure 4. Total makespan as the sze of data transfers ncreases. Next we evaluated the algorthms as the sze of data transfers ncreases. Fg. 4 dsplays the performance as the average transfer tme changes from beng equal to the average executon tme (x1) towards beng 10 tmes more costly (x10). Notce that the makespan ncreases rather slowly compared to the ncrease n data transfer szes. Ths ndcates that the processng tme stll domnates the makespan whch s expected snce only two tasks can execute n parallel, whle each node can fetch the data of 4 tasks smultaneously ( K = 4 ). MaxMn and BMaxMn-U outperform ther counterparts n most cases (wth the latter havng a small edge over the former). Ths s due to the fact that MaxMn prortzes large transfers. However, n dong so t keeps the processng nodes dle at the begnnng of the schedule. MnMn on the other hand tends to utlze the nodes (for processng) early on, wth the drawback beng that large transfers are left towards the end of the schedule and are thus not sent to the best possble nodes. BMaxMn-U combnes the benefts of both approaches wnnng n most cases. Notce also that BMaxMn-D does not acheve performance comparable to BMaxMn-U, due to the fact that fewer

tasks are backflled. Smlar observatons can be drawn n Fg. 5 whch demonstrates the total makespan as the estmated runnng tmes of the tasks ncrease up to 10 tmes ther basc value. In ths case MnMn together wth BMaxMn-U offer the best performance. 800000 700000 600000 500000 400000 300000 200000 100000 0 x1 x2 x5 x10 Estmated executon tme ncrease MnMn MaxMn BMaxMn-D BMaxMn-U Fgure 5. Total makespan as the estmated executon tme ncreases. In the last experment we vared the number of tasks that can be executed n parallel at each node. Fg. 6 llustrates the performance when the degree of parallelsm at each node (due to resource constrants) ncreases from 1 (.e., no parallel executons) to 4. Notce that BMaxMn-U together wth MnMn offers agan the best makespan. 140000 120000 100000 80000 60000 40000 20000 0 1 2 4 Number of tasks that can be executed n parallel MnMn MaxMn BMaxMn-D BMaxMn-U Fgure 6. Total makespan as more tasks can be executed n parallel at each node. Overall, from the experments we can conclude that MaxMn acheves better performance when the workload s data-ntensve, whle MnMn outperforms MaxMn f the workload s computaton-ntensve. BMaxMn-U combnes the strengths of both algorthms outperformng them n most cases, whle the BMaxMn-D varant s nferor. In more extensve expermentatons (not shown due to lack of space) we further checked the above conclusons and found that although n most cases they are true, whenever the setup was prmarly bandwdth constraned (and not resource constraned as the one here) the benefts from BMaxMn-U tend to dmnsh. In future work we plan to nvestgate other varants of BMaxMn that do not use all avalable channels for backfll but rather follow a more judcous approach. 5. Related work Task schedulng n heterogeneous envronments has seen sgnfcant research effort (see [4] and [7] for surveys). In [2] the authors evaluated 11 heurstcs for schedulng ndependent tasks requrng no data transfers. MnMn together wth a Genetc Algorthm (GA) were shown to acheve the best performance. For the case of nteractng tasks a heurstc based on the cross-entropy method called MaTCH was proposed n [12] and was found to outperform GA. Mappng was also the goal of [8] where the authors propose a Web servce lke archtecture to dynamcally map ndependent jobs to more than one type of resources. Varous papers focused on data ntensve applcatons, e.g., [3], [9] and [15] to name a few. In [3] an nteger programmng approxmaton was expermentally proven to outperform a greedy algorthm (resemblng MaxMn) whenever data transfers were constraned by the avalable storage space. In [9] a GA was found to outperform greedy algorthms based on FIFO schedulng. For the case where multple data fles must be acqured from varous sources before a task can commence executon, a heurstc based on the set coverng problem that prortzes transfers over computatons acheves a good trade-off between runnng tme and soluton qualty as shown n [15]. The aforementoned works are perhaps the closest to ours snce they too, am at schedulng both computatons and data transfers. However, we dffer sgnfcantly both n the adopted model (e.g., [3] assumes that data transfer delay s due to processng capacty at sources, whle [9] and [15] do not consder any resource constrants on the processng nodes) and n the heurstcs snce we dentfy the potental of schedulng a mx of large and small data transfers and realze t by combnng the backfllng technque [5], [16] wth MaxMn. Two other alternatves to schedulng data and computatons are descrbed n [1] and [11]. In [11] data transfers are completely decoupled from the schedulng process and are performed n a preemptve manner accordng to dataset popularty, whle n [1] the authors provde optmal solutons to dfferent varatons of the schedulng problem under the assumpton that tasks can

be arbtrarly dvded. Our work dffers n scope from both these works. As future research we plan to nvestgate the case of DAG schedulng [7] whch has experenced a renewed nterest under the context of schedulng workflowbased applcatons [8], [10]. 6. Conclusons In ths paper we dscussed the schedulng of ndependent tasks n a heterogeneous envronment nvolvng data transfers. The model we adopted captures all the man parameters due to heterogenety whle mposng a constrant on the number of transfers that can be conducted smultaneously by a sngle node. We develop algorthms wth the goal of mnmzng the total makespan. Among them, the backfll varant of MaxMn was found to acheve the best performance n most cases. In an extended verson we plan to nvestgate further mprovements on BMaxMn, as well as take advantage of the strengths of Genetc Algorthms. 7. References [1] O. Beaumont, A. Legrand and Y. Robert, Optmal Algorthms for Schedulng Dvsble Workloads on Heterogeneous Systems, n Proc. 17 th Int. Parallel and Dstrbuted Processng Symp. (IPDPS 2003), Nce, France, 2003. [2] T. Braun, H. Segel, N. Beck, L. Bolon, M. Maheswaran, A. Reuther, J. Robertson, M. Theys and B. Yao, A Comparson of Eleven Statc Heurstcs for Mappng a Class of Independent Tasks onto Heterogeneous Dstrbuted Computng Systems, n Journal of Parallel and Dstrbuted Computng (JPDC), Vol. 61(6), pp. 810 837, June 2001. [3] F. Desprez and A. Vernos, Smultaneous Schedulng of Replcaton and Computaton for Data-Intensve Applcatons on the Grd, Research Report RR2005-01, INRIA, France, Jan. 2005. [4] D. Fetelson, L. Rudolph and U. Schwegelshohn, Parallel job schedulng-a status report, n Proc. 10 th Workshop on Job Schedulng Strateges for Parallel Processng (JSSPP 04), pp. 1-16, June 2004. [5] E. Frachtenberg, D. Fetelson, F. Petrn and J. Fernandez, Adaptve Parallel Job Schedulng wth Flexble CoSchedulng, n IEEE Trans. on Parallel and Dstrbuted Systems (TPDS), Vol. 16(11), pp. 1066-1077, Nov. 2005. [6] M. R. Garey and D. S. Johnson, Computers and Intractablty, a Gude to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. [7] Yu-K. Kwok and I. Ahmad, Statc schedulng algorthms for allocatng drected task graphs to multprocessors, n ACM Computng. Surveys, Vol. 31(4), pp. 406-471, 1999. [8] G. Malewcz, A. Rosenberg and M. Yurkewych, On Schedulng Complex Dags for Internet-Based Computng, n Proc. 19 th Int. Parallel and Dstrbuted Processng Symp. (IPDPS 2005), Denver, Colorado, USA, Aprl 2005. [9] T. Phan, K. Ranganathan and R.Son, Evolvng Toward the Perfect Schedule: Co-Schedulng Job Assgnments and Data Replcaton n Wde-Area Systems Usng a Genetc Algorthm, n Proc. 11th Workshop on Job Schedulng Strateges for Parallel Processng (JSSPP 2005), June 19, 2005. [10] R. Prodan and T. Fahrnger, Dynamc Schedulng of Scentfc Workflow Applcatons on the Grd : A Case Study, n Proc. ACM Symp. on Appled Computng (SAC 05), pp. 687-694, Santa Fe, New Mexco, USA, March 2005. [11] K.Ranganathan and I. Foster, Decouplng Computaton and Data Schedulng n Dstrbuted Data- Intensve Applcatons, n Proc. 11th IEEE Symposum on Hgh Performance Dstrbuted Computng (HPDC 02), Ednburgh, Scotland, July 2002. [12] S. Sanyal and S. K. Das, MaTCH: Mappng Data- Parallel Tasks on a Heterogeneous Computng Platform usng the Cross-Entropy Heurstc, n Proc. 19th Int. Parallel and Dstrbuted Processng Symp. (IPDPS 2005), Denver, Colorado, USA, Aprl 2005. [13] O. Snnen, L. Sousa and F. Sandnes, Toward a Realstc Task Schedulng Model, n IEEE Trans. on Parallel and Dstrbuted Systems (TPDS), Vol. 17(3), pp. 263-275, March 2006. [14] Xan-He Sun and Mng Wu, GHS: A Performance System of Grd Computng, n Proc. 19th Int. Parallel and Dstrbuted Processng Symp. (IPDPS 2005), Denver, Colorado, USA, Aprl 2005. [15] S. Venugopal and R. Buyya, A Set Coverage-based Mappng Heurstc for Schedulng Dstrbuted Data-Intensve Applcatons on Global Grds, Techncal Report, GRIDS- TR-2006-3, Grd Computng and Dstrbuted Systems Laboratory, The Unversty of Melbourne, Australa, March 8, 2006. [16] Y. Zhang, H. Franke, J. E. Morera, and A. Svasubramanam, Improvng parallel job schedulng by combnng gang schedulng and backfllng technques, n Proc. 14th Int. Parallel and Dstrbuted Processng Symp. (IPDPS 2000), pp. 133-142, Cancun, May 2000. Acknowledgements Ths work was partally supported by the European Socal Fund and Natonal Resources - (EPEAEK-II) ARCHIMIDIS Ι Ι.