Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions

Size: px

Start display at page:

Download "Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions"

Loren Craig
5 years ago
Views:

1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 0000; 00:1 20 Published online in Wiley InterScience ( Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions Rafael Ferreira da Silva 1,2, Tristan Glatard 1,3, Frédéric Desprez 4 1 University of Lyon, CNRS, INSERM, CREATIS, Villeurbanne, France 2 University of Southern California, Information Sciences Institute, Marina Del Rey, CA, USA 3 McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University, Canada 4 INRIA, University of Lyon, LIP, ENS Lyon, Lyon, France SUMMARY Distributed computing infrastructures are commonly used for scientific computing, and science gateways provide complete middleware stacks to allow their transparent exploitation by end-users. However, administrating such systems manually is time-consuming and sub-optimal due to the complexity of the execution conditions. Algorithms and frameworks aiming at automating system administration must deal with online and non-clairvoyant conditions, where most parameters are unknown and evolve over time. We consider the problem of controlling task granularity and fairness among scientific workflows executed in these conditions. We present two self-managing loops monitoring the fineness, coarseness, and fairness of workflow executions, comparing these metrics to thresholds extracted from knowledge acquired in previous executions, and planning appropriate actions to maintain these metrics to appropriate ranges. Experiments on the European Grid Infrastructure show that our task granularity control can speed-up executions up to a factor of 2, and that our fairness control reduces slowdown variability by 3 to 7 compared to first-comefirst-served. We also study the interaction between granularity control and fairness control: our experiments demonstrate that controlling task granularity degrades fairness, but that our fairness control algorithm can compensate this degradation. Copyright c 0000 John Wiley & Sons, Ltd. Received... KEY WORDS: Distributed computing infrastructures, task granularity, fairness, scientific workflows, online conditions, non-clairvoyant conditions. 1. INTRODUCTION Distributed computing infrastructures (DCI) such as grids and clouds are becoming daily instruments of scientific research, commonly exploited by science gateways [1] to offer transparent service to end-users. However, the large scale of these infrastructures, their heterogeneity and the complexity of their middleware stacks make them prone to software and hardware errors manifested by uncompleted or under-performing executions. Substantial system administration efforts need to be invested to improve quality of service, still at a non-optimal level. Autonomic computing [2], in particular self-management, refers to the set of methods and algorithms addressing this challenge with control loops based on Monitoring, Analysis, Planning, Execution and Knowledge (MAPE-K). These algorithms have to cope with online conditions, where the platform load and infrastructure status constantly change, and non-clairvoyant conditions, where most of the applications and infrastructure s characteristics are unknown. The interactions between Correspondence to: 4676 Admiralty Way, Suite 1001, 90292, Marina del Rey, CA, USA. rafsilva@isi.edu Copyright c 0000 John Wiley & Sons, Ltd. [Version: 2010/05/13 v3.00]

2 2 R. FERREIRA DA SILVA, T. GLATARD, F. DESPREZ self-managing loops have to be properly studied to check that their individual objectives do not conflict. This paper focuses on the control of task granularity and fairness among executions described as scientific workflows [3] in a science gateway. Task granularity is commonly optimized by grouping, or clustering, tasks of a workflow execution to minimize the impact of data transfers, task queuing times, and other overheads [4 13]. Group sizes have to be carefully tuned to avoid harmful parallelism losses. Fairness is achieved when resources are allocated to scientific workflow executions in proportion to their requirements rather than as a result of race conditions. A way to quantify the fairness obtained for a particular workflow execution is to determine the slowdown compared to an ideal situation where it would run alone on the whole infrastructure [14, 15]. Other fairness metrics can also be used, as for instance in [16]. We describe two self-managing loops to control task granularity and fairness of scientific workflows executed in online, non-clairvoyant conditions on DCIs. These algorithms periodically evaluate metrics estimating the coarseness, fineness, and fairness of workflow executions. When these metrics exceed thresholds determined from historical executions, pre-defined actions are implemented: (i) tasks are grouped or de-grouped to adjust granularity (ii) tasks are re-prioritized to improve fairness. These two loops were independently introduced and evaluated in [17] and [18]. In this work, we complement these previous presentations by studying the interaction between the two algorithms. Adjusting task granularity obviously impacts resource allocation, therefore fairness among executions. We approach this issue from an experimental angle, testing the following hypotheses: H1: the granularity control loop reduces fairness among executions; H2: the fairness control loop avoids this reduction. The paper is organized as follows. Section 2 reviews the work related to task grouping and fairness control. Section 3 describes the two self-managing loops to control task granularity and fairness. Section 4 presents experiments evaluating our two control processes independently, and their interaction. Experiments are performed in real conditions, on the production system of the European Grid Infrastructure (EGI) accessed through the Virtual Imaging Platform (VIP [19]). 2. RELATED WORK Task grouping. The low performance of fine-grained tasks is a common problem in widely distributed platforms where the scheduling overhead and queuing times are high, such as grid and cloud systems. Several works have addressed the control of task granularity of bag of tasks. For instance, Muthuvelu et al. [4] proposed an algorithm to group bag of tasks based on their granularity size defined as the processing time of the task on the resource. Resources are ordered by their decreasing values of capacity (in MIPS) and tasks are grouped up to the resource capacity. This process continues until all tasks are grouped and assigned to resources. Then, Keat et al. [5] and Ang et al. [6] extended the work of Muthuvelu et al. by introducing bandwidth in the scheduling framework to enhance the performance of task scheduling. Resources are sorted in decreasing order of bandwidth, then assigned to grouped tasks downward ordered by processing requirement length. The size of a grouped task is determined from the task cost in millions instructions (MI). Later, Muthuvelu et al. [7] extended [4] to determine task granularity based on QoS requirements, task file size, estimated task CPU time, and resource constraints. Meanwhile, Liu & Liao [8] proposed an adaptive fine-grained job scheduling algorithm (AFJS) to group lightweight tasks according to processing capacity (in MIPS) and bandwidth (in Mb/s) of the current available resources. Tasks are sorted in decreasing order of MI, then clustered by a greedy algorithm. To accommodate with resource dynamicity, the grouping algorithm integrates monitoring information about the current availability and capability of resources. Afterwards, Soni et al. [9] proposed an algorithm to group lightweight tasks into coarse-grained tasks (GBJS) based on processing capability, bandwidth, and memory-size of the available resources. Tasks are sorted into ascending order of required computational power, then, selected in first come first serve order to be grouped according to the

3 FAIRNESS AND TASK GRANULARITY CONTROL IN ONLINE, NON-CLAIRVOYANT CONDITIONS 3 capability of the resources. Zomaya and Chan [10] studied limitations and ideal control parameters of task clustering by using genetic algorithms. Their algorithm performs task selection based on the earliest task start time and task communication costs; it converges to an optimal solution of the number of clusters and tasks per cluster. Although the reviewed works significantly reduce communication and processing time, neither of them are non-clairvoyant and online at the same time. Recently, Muthuvelu et al. [11, 12] proposed an online scheduling algorithm to determine the task granularity of compute-intensive bag-of-tasks applications. The granularity optimization is based on task processing requirements, resource-network utilisation constraint (maximum time a scheduler waits for data transfers), and users QoS requirements (user s budget and application deadline). Submitted tasks are categorised according to their file sizes, estimated CPU times, and estimated output file sizes, and arranged in a tree structure. The scheduler selects a few tasks from these categories to perform resource benchmarking. Tasks are grouped according to seven objective functions of task granularity, and submitted to resources. The process restarts upon task arrival. In a collaborative work [13], we presented three balancing methods to address the load balancing problem when clustering scientific workflow tasks. We defined three imbalance metrics to quantitative measure workflow characteristics based on task runtime variation (HRV), task impact factor (HIFV), and task distance variance (HDV). Although these are online approaches, the solutions are still clairvoyant. Fairness. Fairness among scientific workflow executions has been addressed in several studies considering the scheduling of multiple scientific workflows, but to the best of our knowledge, no algorithm was proposed in a non-clairvoyant and online case. For instance, Zhao and Sakellariou [14] address fairness based on the slowdown of Directed Acyclic Graph (DAG); they consider a clairvoyant problem where the execution time and the amount of data transfers are known. Similarly, N Takpé and Suter [20] propose a mapping procedure to increase fairness among parallel tasks on multi-cluster platforms; they address an offline and clairvoyant problem where tasks are scheduled according to one of the following three characteristics: critical path length, maximal exploitable task parallelism, or amount of work to execute. Casanova et al. [15] evaluate several scheduling online algorithms of multiple parallel task graphs (PTGs) on a single, homogeneous cluster. Fairness is measured through the maximum stretch (a.k.a. slowdown) defined by the ratio between the PTG execution time on a dedicated cluster, and the PTG execution time in the presence of competition with other PTGs. Hsu et al. [21] propose an online HEFT-like algorithm to schedule multiple workflows; they address a clairvoyant problem where tasks are ranked based on the length of their critical path, and tasks are mapped to the resources with the earliest finish time. Sommerfeld and Richter [22] present a two-tier HEFT-based grid workflow scheduler with predictions of input-queue waiting times and task execution times; fairness among workflow tasks is addressed by preventing HEFT to assign the highest ranks to the first tasks regardless of their originating workflows. Hirales-Carbajal et al. [23] schedule multiple parallel workflows on a grid in a non-clairvoyant but offline context, assuming dedicated resources. Their multi-stage scheduling strategies consist of task labeling and adaptive allocation, local queue prioritization and site scheduling algorithm. Fairness among workflow tasks is achieved by task labeling based on task run time estimation. Recently, Arabnejad and Barbosa [24] proposed an algorithm addressing an online but clairvoyant problem where tasks are assigned to resources based on their rank values; task rank is determined from the smallest remaining time among all remaining tasks of the workflow, and from the percentage of remaining tasks. Finally, in their evaluation of non-preemptive task scheduling, Sabin et al. [25] assess fairness by assigning a fair start time to each task, defined by the start time of the task on a complete simulation of all tasks whose queue time is lower than that one. Any task which starts after its fair start time is considered to have been treated unfairly. Results are trace-based simulations over a period of one month, but the study is performed in a clairvoyant context. Skowron and Rzadca [26] proposed an online and non-clairvoyant algorithm to schedule sequential jobs on distributed systems. They consider a non-clairvoyant model where job s processing time is unknown until the job completes. However, they assume that resources are homogeneous (what is not the case on grid computing). In contrast, our method considers resource

4 4 R. FERREIRA DA SILVA, T. GLATARD, F. DESPREZ performance, the execution of concurrent activities, and task dependency in scientific workflow executions. 3. METHODS Scientific workflows consist of activities linked by data and control dependencies. Activities are bound to an application executable. At runtime, they are iterated on the data to generate tasks. A workflow activity is said active if it has at least one waiting (queued) or running task. Tasks cannot be pre-empted, i.e. only the tasks in status waiting can be modified. More details about our workflow model are available in [27]. Tasks related to an activity are assumed independent from each other, yet with similar cost (bag of tasks). It means that if these tasks were executed in the exact same conditions, they would be of identical duration. The following sub-sections describe our task granularity and fairness control processes. In all the algorithms, η describes the degree quantifying the controlled variable (fineness, coarseness, or unfairness), and τ describes the threshold beyond which actions have to be triggered. Indexes refer to the controlled variable: f for fineness, c for coarseness, and u for unfairness. For instance, η f is the fineness degree, and τ u is the unfairness threshold Task Granularity Control Process Algorithm 1 describes our task granularity control composed of two processes: (i) the fineness control process groups too fine task groups for which the fineness degree η f is greater than threshold τ f, and (ii) the coarseness control process de-groups too coarse task groups for which the coarseness degree η c is greater than threshold τ c. This subsection describes how η f, η c, τ f and τ c are determined, and details the grouping and de-grouping algorithms. Algorithm 1 Main loop for granularity control 1: input: n waiting tasks 2: create n 1-task groups T i 3: while there is an active task group do 4: wait for timeout or task status change 5: determine fineness degree η f 6: if η f >τ f then 7: group task groups using Algorithm 2 8: end if 9: determine coarseness degree η c 10: if η c >τ c then 11: degroup coarsest task groups 12: end if 13: end while Measuring fineness: η f. Let n be the number of waiting tasks in a scientific workflow activity, and m the number of task groups. Initially, 1 group is created for each task (n = m). T i is the set of tasks in group i, and n i is the number of tasks in T i. Groups are a partition of the set of waiting tasks: i j T j = and m i=1 n i = n. The activity fineness degree η f is the maximum of all group fineness degrees f i : η f = max (f i). (1) i [1,m] High fineness degrees indicate fine granularities. We use a max operator in this equation to ensure that any task group with a too fine granularity will be detected. The fineness degree f i of group i is defined as: f i = d i r i, (2)

5 FAIRNESS AND TASK GRANULARITY CONTROL IN ONLINE, NON-CLAIRVOYANT CONDITIONS 5 where r i is the queuing ratio (described below), and d i is the ratio between the transfer time of the input data shared among all tasks in the activity, and the total execution time of the group: d i = t shared t shared + n i ( t t shared ), where t shared is the median transfer time of the input data shared among all tasks in the activity, and t is the sum of its median task phase durations corresponding to application setup, input data transfer, application execution and output data transfer: t = t setup + t input + t exec + t output. Median values t shared and t are computed from values measured on completed tasks. When less than 2 tasks are completed, medians remain undefined and the control process is inactive. This online estimation makes our process non-clairvoyant with respect to the task duration which is progressively estimated as the workflow activity runs. aware that using the median may be inaccurate. However, without a model of the applications execution time, we must rely on observed task durations. Using the whole time distribution (or at least its few first moments) may be more accurate but it would complexify the method. In equation 2, r i is the ratio between the max of the task current queuing times q i in the group (measured for each task individually), and the total round-trip time (queuing+execution) of the group: r i = max j [1,ni] q j max j [1,ni] q j + t shared + n i ( t t shared ) The group queuing time is the max of all task queuing times in the group, while the group execution time is the time to transfer shared input data plus the time to execute all task phases in the group except for the transfers of shared input data. Note that d i, r i, and therefore f i and η f are in [0, 1]. η f tends to 0 when there is little shared input data among the activity tasks or when the task queuing times are low compared to the execution times; in both cases, grouping tasks is indeed useless. Conversely, η f tends to 1 when the transfer time of shared input data becomes high, and the queuing time is high compared to the execution time; grouping is needed in this case. Thresholding fineness: τ f. The threshold value for η f separates configurations where the activity s fineness is acceptable (η f τ f ) from configurations where the activity is too fine (η f > τ f ). We determine τ f from execution traces, inspecting the modes of the distribution of η f. Values of η f in the highest mode of the distribution, i.e. which are clearly separated from the others, will be considered too fine. Threshold value determined from traces are likely to be dependent on the execution conditions covered by the traces. In particular, traces have to be extracted from the platform where the methods will be implemented. The traces were collected from VIP [19] between January 2011 and April 2012, made available through the science-gateway workload archive [28]. The data set contains 680, 988 tasks (including resubmissions and replications) linked to activities of 2, 941 scientific workflows executed by 112 users; task average waiting time is about 36 min. Applications deployed in VIP are described as scientific workflows executed using the MOTEUR workflow engine [29]. Resource provisioning and task scheduling are provided by DIRAC [30] using so-called pilot jobs. Resources are provisioned online with no advance reservations. Tasks are executed on the biomed virtual organization (VO) of the European Grid Infrastructure (EGI) which has access to some 150 computing sites worldwide and to 120 storage sites providing approximately 4 PB of storage. Figure 1 (left) shows the distribution of sites per country supporting the biomed VO. The fineness degree η f was computed after each event found in the data set. Figure 1 (right) shows the histogram of these values. The histogram appears bimodal, which indicates that η f separates the platform configurations in two distinct groups, separated by the threshold value τ f = In our case, this threshold value is visually determined from the histogram. Simple clustering methods

6 6 R. FERREIRA DA SILVA, T. GLATARD, F. DESPREZ such as k-means can be used to determine the threshold value automatically [31]. We consider that these groups correspond to acceptable fineness (lowest mode) and too fine granularity (highest mode). For η f 0.55, task grouping will be triggered Batch queues Sites Frequency UK France Italy Germany Netherlands Greece Spain Portugal Croatia Poland Bulgaria Turkey Brazil FYROM Others Frequency 0e+00 6e Figure 1. Distribution of sites and batch queues per country in the biomed VO (January 2013) (left) and histogram of fineness incident degree sampled in bins of 0.05 (right). η f Task grouping. Algorithm 2 describes our task grouping mechanism. Groups with f i > τ f are grouped pairwise until η f τ f or until the amount of waiting groups Q is smaller or equal to the amount of running groups R. Although η f ignores scattering (Eq. 1 uses a max), the algorithm considers it by grouping tasks in all groups where f i > τ f. Ordering groups by decreasing f i values tends to equally distribute tasks among groups. The grouping process stops when Q R to avoid parallelism loss. This condition also avoids conflicts with the de-grouping process described above. Algorithm 2 Task grouping 1: input: f 1 to f m //group fineness degrees, sorted in decreasing order 2: input: Q, R // number of queued and running task groups 3: for i = 1 to m 1 do 4: j = i + 1 5: while f i > τ f and Q > R and j m do 6: if f j > τ f then 7: Group all tasks of T j into T i 8: Recalculate f i using Equation 2 9: Q = Q 1 10: end if 11: j = j : end while 13: i = j 14: end for 15: Delete all empty task groups Measuring coarseness: η c. Condition Q > R used in Algorithm 2 ensures that all resources will be exploited if the number of available resources is stationary. In case the number of available resources decreases, the fineness control process may further reduce the number of groups. However, if the number of available resources increases, task groups may need to be de-grouped to maximize resource exploitation. This de-grouping is implemented by our coarseness control process. The coarseness control process monitors the value of η c defined as: η c = R Q + R. (3) The threshold value τ c is set to 0.5 so that η c > τ c Q < R. When an activity is considered too coarse, its groups are ordered by increasing values of η f and the first groups (i.e. the coarsest ones) are split until η c < τ c. Note that de-grouping increases the number of queued tasks, therefore tends to reduce η c.

7 FAIRNESS AND TASK GRANULARITY CONTROL IN ONLINE, NON-CLAIRVOYANT CONDITIONS Fairness Control Process Algorithm 3 describes our fairness control process. Fairness is controlled by allocating resources to workflows according to their fraction of pending work. It is done by re-prioritising tasks in workflows where the unfairness degree η u is greater than a threshold τ u. This section describes how η u and τ u are computed, and details the re-prioritization algorithm. Algorithm 3 Main loop for fairness control 1: input: m workflow executions 2: while there is an active workflow do 3: wait for timeout or task status change in any workflow 4: determine unfairness degree η u 5: if η u >τ u then 6: re-prioritize tasks using Algorithm 4 7: end if 8: end while Measuring unfairness: η u. Let m be the number of workflows with an active activity. The unfairness degree η u is the maximum difference between the fractions of pending work: η u = W max W min, (4) with W min = min{w i, i [1, m]} and W max = max{w i, i [1, m]}. All W i are in [0, 1]. For η u = 0, we consider that resources are fairly distributed among all workflows; otherwise, some workflows consume more resources than they should. The fraction of pending work W i of a workflow i [1, m] is defined from the fraction of pending work w i,j of its n i active activities: W i = max (w i,j) (5) j [1,n i] All w i,j are between 0 and 1. A high w i,j value indicates that the activity has a lot of pending work compared to the others. We define w i,j as: w i,j = Q i,j Q i,j + R i,j P i,j ˆT i,j, (6) where Q i,j is the number of waiting tasks in the activity, R i,j is the number of running tasks in the activity, P i,j is the performance of the activity, and ˆT i,j is its relative observed duration. ˆT i,j is defined as the ratio between the median duration t i,j of the completed tasks in activity j and the maximum median task duration among all active activities of all running workflows: ˆT i,j = t i,j max v [1,m],w [1,n i ]( t v,w ) (7) where t i,j is computed as t i,j = t setup i,j + t input i,j + t exec i,j + t output i,j. Medians are progressively estimated as tasks complete. At the beginning of the execution, ˆT i,j is initialized to 1 and all medians are undefined; when two tasks of activity j complete, t i,j is updated and ˆT i,j is computed with equation 7. In this equation, the max operator is computed only on n i n i activities with at least 2 completed tasks, i.e. for which t i,j can be determined. In Eq. 6, the performance P i,j of an activity varies between 0 and 1. A low P i,j indicates that resources allocated to the activity have bad performance for the activity; in this case, the contribution of running tasks is reduced and w i,j increases. Conversely, a high P i,j increases the contribution of running tasks, therefore decreases w i,j. For an activity j with k j active tasks, we define P i,j as: ( { }) tu P i,j = 2 1 max, (8) u [1,k j] t i,j + t u

8 8 R. FERREIRA DA SILVA, T. GLATARD, F. DESPREZ Frequency 0e+00 6e η u Figure 2. Histogram of the unfairness degree η u sampled in bins of where t u = t setup u + t input u + t exec u + t output u is the sum of the estimated durations of task u s phases. Estimated task phase durations are computed as the max between the current elapsed time in the task phase (0 if the task phase has not started) and the median duration of the task phase. P i,j is initialized to 1, and updated using Eq. 8 only when at least 2 tasks of activity j are { completed. } t If all tasks perform as the median, i.e. t u = t i,j, then max u u [1,kj] = 0.5 t i,j+t u and P i,j = 1. Conversely, { if a task } in the activity lasts significantly longer than the median, i.e. t u t i,j, then t max u u [1,kj] 1 t i,j+t u and P i,j 0. This definition of P i,j considers that bad performance leads to a few tasks blocking the activity. Indeed, we assume that the scheduler does not deliberately favor any activity and that performance discrepancies are manifested by a few unlucky tasks slowed down by bad resources. Performance, in this case, has a relative definition: depending on the activity profile, it can correspond to CPU, RAM, network bandwidth, latency, or a combination of those. We admit that this definition of P i,j is a bit rough. However, under our non-clairvoyance assumption, estimating resource performance for the activity more accurately is hardly possible because (i) we have no model of the application, therefore task durations cannot be predicted from CPU, RAM or network characteristics, and (ii) network characteristics and even available RAM are shared among concurrent tasks running on the infrastructure, which makes them hardly measurable. Thresholding unfairness: τ u. Task prioritisation is triggered when the unfairness degree is considered critical, i.e η u > τ u. Similarly to the fineness degree, the unfairness degree η u was computed after each event found in the data set. Figure 2 shows the histogram of these values, where only η u 0 values are represented. This histogram is clearly bi-modal, which is a good property since it reduces the influence of τ u. From this histogram, we choose τ u = 0.2, a threshold value that clearly separates the two modes. For η u > 0.2, the execution is considered unfair and task prioritization is triggered. In this particular case, the two modes are so well separated that the choice of the threshold value is not critical: any value between 0.2 and 0.9 can safely be chosen as it will have only minor effect on mode classification. Task prioritization. The action taken to cope with unfairness is to increase the priority of i,j waiting tasks for all activities j of workflow i where w i,j W min > τ u. Running tasks cannot be pre-empted. Task priority is an integer initialized to 1. i,j is determined so that w i,j = W min + τ u, where w i,j is the estimated value of w i,j after i,j tasks are prioritized. We approximate w i,j as: w i,j = Q i,j i,j Q i,j + R i,j P i,j ˆTi,j, which assumes that i,j tasks will move from status queued to running, and that the performance of new resources will be maximal. It gives:

9 FAIRNESS AND TASK GRANULARITY CONTROL IN ONLINE, NON-CLAIRVOYANT CONDITIONS 9 Algorithm 4 Task re-prioritization 1: input: W 1 to W m //fractions of pending works 2: maxpriority = max task priority in all workflows 3: for i=1 to m do 4: if W i W min > τ u then 5: for j=1 to a i do 6: //a i is the number of active activities in workflow i 7: if w i,j W min > τ u then 8: Compute i,j from equation 9 9: for p=1 to i,j do 10: if waiting task q in activity j with priority maxpriority then 11: q.priority = maxpriority : end if 13: end for 14: end if 15: end for 16: end if 17: end for i,j = Q i,j (τ u + W min )(Q i,j + R i,j P i,j ) ˆT i,j, (9) where rounds a decimal down to the nearest integer value. Algorithm 4 describes our task re-prioritization. maxpriority is the maximal priority value in all workflows. The priority of i,j waiting tasks is set to maxpriority+1 in all activities j of workflows i where w i,j W min > τ u. Note that this algorithm takes into account scatter among W i although η u ignores it (see Eq. 4). Indeed, tasks are re-prioritized in any workflow i for which W i W min > τ u. The method also accommodates online conditions. If a new workflow i is submitted, then R i,j = 0 for all its activities and ˆT i,j is initialized to 1. This leads to W max = W i = 1, which increases η u. If η u goes beyond τ u, then i,j tasks of activity j of workflow i have their priorities increased to restore fairness. Similarly, if new resources arrive, then R i,j increase and η u is updated accordingly. Table I shows a simple example of the interaction of the task granularity and fairness control loops. The example illustrates the scenario where the granularity control loop reduces fairness among executions, and the fairness control loops counterbalances this reduction. 4. EXPERIMENTS The experiments presented hereafter evaluate the task granularity and fairness control processes and their interaction. In the first scenario, we evaluate fairness, fineness and coarseness control independently, under stationary and non-stationary loads. In the second scenario, we activate both control processes to test the two hypotheses already mentioned in the introduction: H1: the granularity control loop reduces fairness among executions; H2: the fairness control loop avoids this reduction Experiment conditions and metrics The target computing platform for these experiments is the biomed VO of EGI where the traces used to determine τ f and τ u were acquired (see Section 3.1). To ensure resource limitation without flooding the production system, experiments are performed only on 3 sites of different countries. Tasks generated by MOTEUR are submitted using the DIRAC scheduler. We use MOTEUR , configured to resubmit failed tasks up to 5 times, and with the task replication mechanism described

10 10 R. FERREIRA DA SILVA, T. GLATARD, F. DESPREZ Consider two identical workflows, each of which composed of one activity with 10 tasks initially split in 10 groups, and assume that task input data are shared among all tasks (i.e. t shared = t input ). Let t = 10 and t shared = 7 (in arbitrary time units) obtained from two completed task groups. At time t, we assume R = 2 and Q = 6 for both workflows with the following values for waiting task groups: j max k [1,nj ] q k d j r j f j Workflow 1 has the granularity control active, while Workflow 2 does not. For Workflow 1, Eq. 1 gives η f = As η f > τ f = 0.55 and Q > R, the activity is considered too fine and task grouping is triggered. Groups with f i > τ f are grouped pairwise until η f τ f or Q R: j max k [1,nj ] q k d j r j f j 11 [5,6] [7,8] [9,10] Groups 5 and 6, 7 and 8, and 9 and 10 are grouped into groups 11, 12, and 13 for Workflow 1, while Workflow 2 remains with 6 task groups. Let s assume at time t > t, the following values: i Q i,1 R i,1 t i,1 P i,1 ˆTi,1 W i = w i, The configuration is clearly unfair since Workflow 2 has more tasks to compute. Eq. 4 gives η u = 0.5. As η u > τ u = 0.2, the platform is considered unfair and task re-prioritization is triggered. 2,1 tasks from Workflow 2 should be prioritized. According to Eq. 9: (τu+w 2,1 = Q 2,1 1 )(Q 2,1 +R 2,1 P 2,1 ) ( )( ) = 5 = 3 ˆT 2,1 1.0 At time t > t : i Q i,1 R i,1 t i,1 P i,1 ˆTi,1 W i = w i, Now, η u = 0.02 < τ u. The platform is considered fair and no action is performed. Table I. Example of the interaction of the task granularity and fairness control loops. in [32] activated. We use the DIRAC v6r6p2 instance provided by France-Grilles, with a firstcome, first-served policy imposed by submitting workflows with decreasing priority values for the fairness control process experiment. Four real medical simulation workflows are considered: GATE [33], SimuBloch [34], FIELD-II [35], and PET-Sorteo [36]; their main characteristics are summarized on Table II. Workflow #Tasks CPU time Input Output t shared GATE (CPU-intensive) 100 few minutes to one hour 115 MB 40 MB - SimuBloch (data-intensive) 25 few seconds 15 MB < 5 MB 90% FIELD-II (data-intensive) 122 few seconds to 15 minutes 208 MB 40 KB 40-60% PET-Sorteo (CPU-intensive) minutes 20 MB 50 MB 50-80% Table II. Workflow characteristics ( indicate task dependencies).

11 FAIRNESS AND TASK GRANULARITY CONTROL IN ONLINE, NON-CLAIRVOYANT CONDITIONS 11 For all experiments, control and tested executions are launched simultaneously to ensure similar grid conditions which may vary among repetitions because computing, storage, and network resources are shared with other users. To cover different grid conditions, experiments are repeated over a time period of 4 weeks. Three different fairness metrics are used in the experiments. Firstly, the standard deviation of the makespan, written σ m, is a straightforward metric which can be used when identical workflows are executed. Secondly, we define the unfairness µ as the area under the curve η u during the execution: µ = M η u (t i ) (t i t i 1 ), i=2 where M is the number of time samples until the makespan. This metric measures if the fairness process can indeed minimize η u. Finally, the slowdown s of a completed workflow execution is defined by [20]: s = M multi M own where M multi is the makespan observed on the shared platform, and M own is the estimated makespan if it was executed alone on the platform. In our conditions, M own is estimated as: M own = max p Ω t u, where Ω is the set of task paths in the workflow, and t u is the measured duration of task u. This assumes that concurrent executions only impact task waiting time, not performance. For instance, network congestion or changes in performance distribution resulting from concurrent executions are ignored. Our third fairness metric is σ s, the standard deviation of the slowdown Task granularity control The granularity control process was implemented as a plugin of the MOTEUR workflow manager, receiving notifications about task status changes and task phase durations. The plugin then uses this data to group and de-group tasks according to Algorithm 1, where the timeout value is set to 2 minutes. Two sets of experiments are conducted, under different load patterns. Experiment 1 evaluates the fineness control process under stationary load. It consists of separated executions of SimuBloch, FIELD-II, and PET-Sorteo/emission. A workflow activity using our task grouping mechanism (Fineness) is compared to a control activity (No-Granularity). Resource contention on the execution sites is maintained high and constant so that no de-grouping is required. Experiment 2 evaluates the interest of using the de-grouping control process under nonstationary load. It uses activity FIELD-II. An execution using both fineness and coarseness control (Fineness-Coarseness) is compared to an execution without coarseness control (Fineness) and to a control execution (No-Granularity). Executions are started under resource contention, but the contention is progressively reduced during the experiment. This is done by submitting a heavy workflow before the experiment starts, and canceling it when half of the experiment tasks are completed. As no online task modification is possible in DIRAC, we implemented task grouping by canceling queued tasks and submitting grouped tasks as a new task. For each grouped task resubmitted in the Fineness or Fineness-Coarseness executions, a task in the No-Granularity is randomly selected and resubmitted too to ensure equal race conditions for resource allocation (first-come, first-served policy), and that each execution faces the same resubmission overhead. Experiment 1 (stationary load). Figure 3 shows the makespan of SimuBloch, FIELD-II, and PET-Sorteo/emission executions. Fineness yields a significant makespan reduction for all repetitions. Table III shows the makespan (M) values and the number of task groups. The task u p

12 12 R. FERREIRA DA SILVA, T. GLATARD, F. DESPREZ grouping mechanism is not able to group all SimuBloch tasks in a single group because 2 tasks must be completed for the process to have enough information about the application (i.e. t shared and t can be computed). This is a constraint of our non-clairvoyant conditions, where task durations cannot be determined in advance. FIELD-II tasks are initially not grouped, but as the queuing time becomes important, tasks are considered too fine and grouped. PET-Sorteo/emission is an intermediary case where only a few tasks are grouped. Results show that the task grouping mechanism speeds up SimuBloch and FIELD-II executions up to a factor of 2.6, and PET-Sorteo/emission executions up to a factor of SimuBloch FIELD II PET Sorteo/emission Makespan (s) Run 1 Run 2 Run 3 Run 4 Run 5 Run 1 Run 2 Run 3 Run 4 Run 5 Run 1 Run 2 Run 3 Run 4 Run 5 Fineness No Granularity Figure 3. Experiment 1: makespan for Fineness and No-Granularity executions for the 3 workflow activities under stationary load SimuBloch FIELD-II PET-Sorteo M (s) Groups M (s) Groups M (s) Groups No-Granularity Fineness No-Granularity Fineness No-Granularity Fineness No-Granularity Fineness No-Granularity Fineness Table III. Experiment 1: makespan (M) and number of task groups for SimuBloch, FIELD-II and PET-Sorteo/emission executions for the 5 repetitions. Experiment 2 (non-stationary load). Figure 4 shows the makespan (top) and evolution of task groups (bottom). Makespan values are reported in Table IV. In the first three repetitions, resources appear progressively during workflow executions. Fineness and Fineness-Coarseness speed up executions up to a factor of 1.5 and 2.1. Since Fineness does not benefit of newly arrived resources, it has a lower speed up compared to No-Granularity due to parallelism loss. In the two last repetitions, the de-grouping process in Fineness-Coarseness allows to reach similar performance than No-Granularity, while Fineness is penalized by its lack of adaptation: a slowdown of 20% is observed compared to No-Granularity. Table IV also shows the average queuing time values for Experiment 2. The linear correlation coefficient between the makespan and the average queuing time is 0.91, which indicates that the makespan evolution is indeed correlated to the evolution of the queuing time induced by the granularity control process. Our task granularity control process works best under high resource contention, when the amount of available resources is stable or decreases over time (Experiment 1). Coarseness control can cope

13 FAIRNESS AND TASK GRANULARITY CONTROL IN ONLINE, NON-CLAIRVOYANT CONDITIONS Makespan (s) Fineness Fineness Coarseness No Granularity 0 Run 1 Run 2 Run 3 Run 4 Run Run 1 Run 2 Run 3 Run 4 Run 5 Task groups Time (s) Fineness Fineness Coarseness No Granularity Figure 4. Experiment 2: makespan (top) and evolution of task groups (bottom) for FIELD-II executions under non-stationary load (resources arrive during the experiment). Run 1 Run 2 Run 3 Run 4 Run 5 M (s) q (s) M (s) q (s) M (s) q (s) M (s) q (s) M (s) q (s) No-Granularity Fineness Fineness-Coarseness Table IV. Experiment 2: makespan (M) and average queuing time ( q) for FIELD-II workflow execution for the 5 repetitions. with soft increases in the number of available resources (Experiment 2), but fast variations remain difficult to handle. In the worst-case scenario, tasks are first grouped due to resource limitation, and resources suddenly appear once all task groups are already running. In this case the de-grouping algorithm has no group to handle, and granularity control penalizes the execution. Task pre-emption should be added to the method to address this scenario. In addition, our method is dependent on the capability to extract enough accurate information from completed tasks to handle active tasks using median estimates. This may not be the case for activities which execute only a few tasks Fairness control Fairness control was implemented as a MOTEUR plugin receiving notifications about task and workflow status changes. Each workflow plugin forwards task status changes and t i,j values to a service centralizing information about all the active workflows. This service then re-prioritizes tasks according to Algorithms 3 and 4. As no online task modification is possible in DIRAC, we implemented task prioritization by canceling and resubmitting queued tasks with new priorities. This implementation decision adds an overhead to task executions. Therefore, the timeout value used in Algorithm 3 is set to 3 minutes. Three experiments are conducted. Experiment 3 tests whether unfairness among identical workflows is properly addressed. It consists of three GATE workflows sequentially submitted, as users usually do in the platform. Experiment 4 tests if the performance of very short workflow executions is improved by the fairness mechanism. Its workflow set has three GATE workflows

14 14 R. FERREIRA DA SILVA, T. GLATARD, F. DESPREZ Repetition 1 Repetition 2 Repetition 3 Repetition 4 Makespan (s) Gate 1 Gate 2 Gate 3 0 Fairness No Fairness Fairness No Fairness Fairness No Fairness Fairness No Fairness 1.00 Repetition 1 Repetition 2 Repetition 3 Repetition η f Fairness No Fairness Time (s) Repetition 1 σ m(s) σ s µ(s) NF F Repetition 2 σ m(s) σ s µ(s) NF F Repetition 3 σ m(s) σ s µ(s) NF F Repetition 4 σ m(s) σ s µ(s) NF F Figure 5. Experiment 3 (identical workflows). Top: comparison of the makespans; middle: unfairness degree η u; bottom: standard deviation of the makespan (σ m), standard deviation of the slowdown (σ s), and unfairness (µ). launched sequentially, followed by a SimuBloch workflow. Experiment 5 tests whether unfairness among different workflows is detected and properly handled. Its workflow set consists of a GATE, a FIELD-II, a PET-Sorteo and a SimuBloch workflow launched sequentially. For each experiment, a workflow set using our fairness mechanism (Fairness F) is compared to a control workflow set (No-Fairness NF). Fairness and No-Fairness are launched simultaneously to ensure similar grid conditions. For each task priority increase in the Fairness workflow set, a task in the No-Fairness workflow set task queue is also prioritized to ensure equal race conditions for resource allocation. Experiment results are not influenced by the resubmission process overhead since both Fairness and No-Fairness experience the same overhead. Experiment 3 (identical workflows). Figure 5 shows the makespan, unfairness degree η u, makespan standard deviation σ m, slowdown standard deviation σ s and unfairness µ for the 4 repetitions. The difference among makespans and unfairness degree values are significantly reduced in all repetitions of Fairness. Both Fairness and No-Fairness behave similarly until η u reaches the threshold value τ u = 0.2. Unfairness is then detected and the mechanism triggers task prioritization. Paradoxically, the first effect of task prioritization is a slight increase of η u. Indeed, P i,j and ˆT i,j, that are initialized to 1, start changing earlier in Fairness than in No-Fairness due to the availability of task duration values to compute t i,j. Note that η u reaches similar maximal values in both cases, but reaches them faster in Fairness. The fairness mechanism then manages to decrease η u back under 0.2 much faster than it happens in No-Fairness when tasks progressively complete. Finally, slight increases of η u are sporadically observed towards the end of the execution. This is due to task replications performed by MOTEUR: when new tasks are created, the fraction of pending work W increases, which has an effect on η u. Quantitatively, the fairness mechanism reduces σ m up to a factor of 15, σ s up to a factor of 7, and µ by about 2.

15 FAIRNESS AND TASK GRANULARITY CONTROL IN ONLINE, NON-CLAIRVOYANT CONDITIONS Repetition 1 Repetition 2 Repetition 3 Repetition 4 Makespan (s) Gate 1 Gate 2 Gate 3 SimuBloch Fairness No Fairness Fairness No Fairness Fairness No Fairness Fairness No Fairness 1.00 Repetition 1 Repetition 2 Repetition 3 Repetition η f Fairness No Fairness Time (s) Repetition 1 σ s µ(s) NF F Repetition 2 σ s µ(s) NF F Repetition 3 σ s µ(s) NF F Repetition 4 σ s µ(s) NF F Figure 6. Experiment 4 (very short execution). Top: comparison of the makespans; middle: unfairness degree η u; bottom: unfairness (µ) and standard deviation of the slowdown (σ s). Experiment 4 (very short execution). Figure 6 shows the makespan, unfairness degree η u, unfairness µ and slowdown standard deviation. In all cases, the makespan of the very short SimuBloch executions is significantly reduced for Fairness. The evolution of η u is coherent with Experiment 1: a common initialization phase followed by an anticipated growth and decrease for Fairness. Fairness reduces σ s up to a factor of 5.9 and unfairness up to a factor of 1.9. Table V shows the execution makespan (m), average wait time ( w) and slowdown (s) values for the SimuBloch execution launched after the 3 GATE. As it is a non-clairvoyant scenario where no information about task execution time and future task submission is known, the fairness mechanism is not able to give higher priorities to SimuBloch tasks in advance. Despite that, the fairness mechanism speeds up SimuBloch executions up to a factor of 2.9, reduces task average wait time up to factor of 4.4 and reduces slowdown up to a factor of 5.9. Run Type m (secs) w (secs) s 1 No-Fairness Fairness No-Fairness Fairness No-Fairness Fairness No-Fairness Fairness Table V. Experiment 4: SimuBloch s makespan (m), average wait time ( w), and slowdown (s). Experiment 5 (different workflows). Figure 7 shows slowdown, unfairness degree, unfairness µ and slowdown standard deviation σ s for the 4 repetitions. Fairness slows down GATE while it speeds up all other workflows. This is because GATE is the longest and the first to be submitted; in No-Fairness, it is favored by resource allocation to the detriment of other workflows. The

16 16 R. FERREIRA DA SILVA, T. GLATARD, F. DESPREZ evolution of η u is similar to Experiments 3 and 4. σ s is reduced up to a factor of 3.8 and unfairness up to a factor of 1.9. Repetition 1 Repetition 2 Repetition 3 Repetition 4 Slowdown Fairness No Fairness Fairness No Fairness Fairness No Fairness Fairness No Fairness FIELD II Gate PET Sorteo SimuBloch η f Repetition 1 Repetition 2 Repetition 3 Repetition Time (s) Fairness No Fairness Repetition 1 σ s µ (s) NF F Repetition 2 σ s µ (s) NF F Repetition 3 σ s µ (s) NF F Repetition 4 σ s µ (s) NF F Figure 7. Experiment 5 (different workflows). Top: comparison of the slowdown; middle: unfairness degree η u; bottom: unfairness (µ) and standard deviation of the slowdown (σ s). In all 3 experiments, fairness optimization takes time to begin because the method needs to acquire information about the applications which are totally unknown when a workflow is launched. We could think of reducing the time of this information-collecting phase, e.g. by designing initialization strategies maximizing information discovery, but it couldn t be totally removed. Currently, the method works best for applications with a lot of short tasks because the first few tasks can be used for initialization, and optimization can be exploited for the remaining tasks. The worst-case scenario is a configuration where the number of available resources stays constant and equal to the number of tasks in the first submitted workflow: in this case, no action could be taken until the first workflow completes, and the method would not do better than first-come-first-served. Pre-emption of running tasks should be considered to address that Interactions between task granularity and fairness control In this section, we evaluate the interaction between both control processes in the same execution. For this experiment, we consider executions of a set of SimuBloch workflows under high resource contention. Two experiments are conducted. Experiment 6 tests whether the task granularity control process penalizes fairness among workflow executions (H1); Experiment 7 tests whether the fairness control process mitigates the unfairness created by the granularity control process (H2). For each experiment, a workflow set where one workflow uses the granularity control process (Granularity - G) is compared to a control workflow set (No-Granularity - NG). A workflow set consists of three SimuBloch workflows sequentially submitted. In the Granularity set, the first workflow has the granularity control process enabled, and the others do not. Experiment 6 has a fairness service which only measures the unfairness among workflow executions, but no action is triggered. In Experiment 7, task prioritization is triggered once unfairness is detected.

On Cluster Resource Allocation for Multiple Parallel Task Graphs

On Cluster Resource Allocation for Multiple Parallel Task Graphs Henri Casanova Frédéric Desprez Frédéric Suter University of Hawai i at Manoa INRIA - LIP - ENS Lyon IN2P3 Computing Center, CNRS / IN2P3