Scheduling Data- and Compute-intensive Applications in Hierarchical Distributed Systems

Size: px

Start display at page:

Download "Scheduling Data- and Compute-intensive Applications in Hierarchical Distributed Systems"

Lenard Richards
6 years ago
Views:

1 Scheduling Data- and Compute-intensive Applications in Hierarchical Distributed Systems Matthias Röhm, Matthias Grabert and Franz Schweiggert Institute of Applied Information Processing Ulm University Ulm, Germany Abstract The growing computerization in modern academic and industrial sectors is generating huge volumes of electronic data. Hierarchical distributed systems based on Grid and Cloud technologies promise to meet the tremendously rising resource requirements of heterogeneous, large-scale and distributed data mining applications. Scheduling plays a pivotal role in such environments. While most schedulers addressing these new challenges have a strong focus on compute-intensive applications, we introduce a new scheduling algorithm to support both compute- and data-intensive applications in dynamic, heterogeneous, hierarchical environments. The developed data-aware scheduling algorithm aims to minimize the completion times of the applications as well as their costs leading to an efficient utilization of all available resources. The algorithm is specifically designed for combined storage and compute resources as these allow jobs to be executed on resources storing the data sets and thus are the key to avoid time-consuming and expensive data transfers. Simulations and first real-world usage experiences in the Fleet Data Acquisition Miner for analyzing the data generated by the Daimler fuel cell vehicle fleet show that the algorithm is suited for the different aspects of today s data analysis challenges. Keywords- data-intensive; scheduling; Cloud; Grid. I. INTRODUCTION Increasing data volumes in many industrial and academic sectors are fueling the need for novel data analysis solutions to extract valuable information. Data mining, as the key methodology to address these information needs, requires effective and efficient resource management to transform the growing data into knowledge. There have been multitudes of efforts to provide specialized resource management solutions for complex data mining scenarios, including peer-to-peer data mining, distributed data stream mining and parallel data mining [1] [2]. Recently, data mining research and development has put a focus on highly data-intensive applications. Google s publications on MapReduce [3][4] inspired many projects working on large data sets. MapReduce frameworks, like Hadoop [5], simplify the development and deployment of peta-scale data mining applications leveraging thousands of machines. MapReduce frameworks are highly scalable because the scheduler uses data location information to avoid data movement and rather send the algorithms to the data. Other current distributed data mining research is motivated by the sharing of heterogeneous, geographic distributed, dynamic resources from multiple administrative domains to support the cooperation of different organizations [6] [7] [8]. This field is generally referred to as data mining in Grid computing environments and is highly related to the Cloud computing paradigm. Most research focused on compute-intensive applications following scheduling principles that are correct for compute-intensive, but not for data-intensive data mining applications. Though different scheduling algorithms have been proposed to optimize the relation between data transfer and execution time [9][10] [11], for data-intensive applications where the limiting factor is not CPU-power but rather storage and network speed, the underlying architecture and environment assumptions may lead to non-optimal schedules. In addition, with the almost unlimited resources available through Cloud providers the traditional scheduling concept where jobs have to be assigned to a set of limited resources is not valid any more. Instead the scheduler has to assign jobs to the resource(s) that best fits the needs of the job while considering the cost to execution time ratio. As current data mining applications are both computeand data-intensive we developed an architecture based on the notion of combined compute and storage resources to bring the advantages of the MapReduce paradigm into worldwide, heterogeneous general-purpose computing environments [12]. In this article we present a multi-objective scheduling algorithm for this dynamic, hierarchical Grid architecture incorporating the Cloud computing resource concepts. This article is organized as follows: First, we briefly introduce the generalized architecture and the implications to data- and compute-intensive scheduling. Then we describe the developed multi-objective scheduling algorithm and compare it with existing Grid scheduling algorithms. Finally, we present the simulation results of the developed scheduling algorithm. II. A GENERALIZED GRID SCHEDULING ARCHITECTURE Grid scheduling algorithms are responsible for mapping application resource requests to available resources. In com-

2 parison with other scheduling problems the scheduler has to make a decision under the following environmental characteristics: An application is represented by one or more jobs which might have dependencies. Each job consists of an executable with parameters, a set of input and output data specifications and requirements. The execution time for a given job is not known in general. A job may require one or more compute resources. A data set may be stored on multiple storage resources. The Grid environment is dynamic and resources are heterogeneous. There is no reliable source of information. Jobs arrive at an unpredictable rate. The main goal of the Grid scheduler is to produce a schedule for all jobs arriving over time that minimizes a given objective function - makespan, average completion time or cost - under these constraints. As this scheduling problem is related to problems that are known to be NPcomplete, there is little chance a polynomial algorithm exists to solve it [13]. Therefore Grid Scheduling algorithms use the structure of the Grid environment to implement heuristics or approximations. For compute-intensive application scenarios the main resource and the limiting factor is CPU-power and the focus of the Grid scheduler is to efficiently use the compute power of multiple compute clusters. In these scenarios it is commonly assumed that the time needed to transfer the input and output data is relatively small compared to the overall execution time. These assumptions lead to the following architecture which forms the basis of current Grid schedulers: (1) Specialized storage servers to store input and output data as well as executables. (2) A set of compute clusters from different organizations each composed of multiple compute nodes for running the algorithms. To provide a high level of transparency, these clusters are treated as one multi-cpu resource. (3) In a traditional Grid scheduling setup multiple clusters from different organisations are connected through relatively slow wide area networks whereas the network bandwidth within an organization is assumed to be infinite. A setup like this fits the needs of compute-intensive applications: To schedule a compute-intensive application requesting n computational resources, the scheduler only has to look for a cluster that has the best n free compute resources. As data transfer time is small compared to the execution time the transfer overhead is sometimes neglected. For most data-intensive applications this assumption can not be hold. In contrary, not CPU-power but storage and network speed are the limiting factors of data-intensive applications. Data-intensive applications therefore require new scheduling strategies as the input data transfer time may well exceed execution time. Now, the scheduler should choose compute and storage resources so that the overall time or cost - depending on the scheduling objective - is minimized. Obviously, a scheduler assuming infinite bandwidth within an organization may produce non-optimal schedules. Another aspect of traditional Grid schedulers is the assumption that there is only a very limited resource set the jobs have to be scheduled to. But with the advances in Cloud computing almost unlimited (compute-) resources are available to the scheduler. We identified the need for some major conceptual enhancements to traditional Grid scheduling architectures to efficiently support compute- and data-intensive applications considering Cloud resources: (1) As the amount of data increases, data can not be efficiently stored and processed from a single storage server within a cluster but has to be distributed over multiple machines. Therefore any resource may provide storage and compute capacity. This combined resource type forms the basis for scalable data-intensive applications as data can be processed directly on the storage location. To increase storage capacity and speed, computational resources of compute clusters may become combined resources by storing data on their local disks. It is important to point out that the combined resources are suitable for data- and computeintensive applications as only new functionality is added. (2) The environment is hierarchical. Resources are mainly organized within a cluster; an organization may have multiple clusters on-site or in the Cloud; and multiple organisations may want to share their resources. This not only implies an administrative hierarchy but also the network connecting the different entities has a hierarchical structure. Resources within a cluster are generally connected through a high-bandwidth low-latency interconnect (Infiniband), whereas inter-cluster or inter-organization network speed is typically much slower. (3) Data-intensive applications are limited by two factors: the storage and network speed. Scheduling algorithms should efficiently use these resources and avoid unnecessary input data transfer through processing the data directly on the resources storing the data or resources nearby. (4) There exist internal and external resources. There are only a limited number of internal resources while any number of external resources may be added at an additional cost. III. A MULTI-OBJECTIVE SCHEDULING ALGORITHM A basic algorithm trying to minimize the average runtime of all jobs (J) for combined compute and storage resources in Grids was presented in [14]. With the increasing availability of Cloud resources this basic scheduling has to be adopted to include the cost of the Cloud resources as well as the cost associated to the wide area networks needed to integrate these resources. Also internal resources might be assigned with costs to encourage a more efficient usage. In addition to low job execution costs (K j ), the users are

3 interested in getting their results as soon as possible so that the completion time of the jobs (C j is another dimension of the scheduling problem. To balance between the cost and the completion time of all jobs the scheduler should minimize the following objective function: F (J) = j J C j K j (1) As not all jobs are known to the scheduler in advance (offline-scheduling) but rather appear over time (onlinescheduling) the minimum of F (J) can only be computed retrospectively. Therefore the scheduler can only schedule a subset of the jobs at a time and can only approximate the minimum of F (J). Data-intensive jobs might be defined as a tuple j = (p, D) where D = d 1,..., d m are the data sets to be processed by program p. The task of the scheduler is to select a tuple (r, S) for each job minimizing the objective function, where r is a compute resource and S = (s 1,..., s m ) is the set of storage resources providing the m data subsets. It is assumed that all data subsets d D have to be available on the execution resource before they can be processed by p. The algorithm presented in this paper uses the hierarchical structure of the environment and a cost to completion time ratio to decide what set of resources should be used for a job. This dual-objective hierarchical scheduling algorithm (DOHS) is depicted in Figure 1. As described above, the algorithm has to produce a schedule with little information about the current state of the system and the jobs. Therefore the algorithm was designed to only require relative storage and compute speeds of the resources as well as the corresponding relative costs. The inputs of the scheduling algorithm are the set of input data D, the program p, the data transfer scheduling weights α 1 to α 4 and the data to compute ratio weights β 1, β 2. As all data subsets d D have to be available on one execution resource, the scheduler has to select a tuple (r, S) for each job where r is a compute resource and S is the set of storage resources providing the data subsets. First, the algorithm produces the set of candidate execution resources as the best (highest compute speed to cost ratio f c ) resources from each cluster in the Grid and all resources storing at least one of the data sets d D. For each of these candidate resources the set of storage resources with minimal aggregated transfer overhead f s with regard to D is generated. From all candidates the scheduler chooses the one with the highest priority. The priority of a resource assignment (r, S, t, c) is computed as the weighted sum of the normalized transfer (t) and compute overhead (c). The algorithm and the functions are based on the following definitions: P := all programs available in the Grid; R := {r 1,..., r n } is the set of all n resources; D := {d 1,..., d m } is the m data sets of the job; N := {(r, s) r, s R, r and s can exchange data directly}; R d := { r r R stores d D }; D r := { d d D is stored on r R }; c r is the cluster of resource r; g r is the grid site of resource r; sp r is the storage speed of resources r and sc r is its corresponding cost; and cp r is the compute power and cc r is the compute cost of resource r with respect to program p, where cp r = 0 and cc r = if r does not fulfill all requirements of p. Different properties of a resource may be used to define the computing and storage power and cost of a resource, but at least the current usage has to be taken into account. The data transfer overhead of a candidate resource tuple (r, S) is computed as the sum of all transfer overheads (r, s). Due to the incomplete environment information, especially missing or imprecise network information, the scheduler uses the weights α 1 to α 4 that represent the hierarchical structure. The data transfer overhead function f s 13 assigns the weights α 1 to α 4 to the product of storage power and cost of s according to the distance between s and r: α 1 if d is stored on the resource itself r = s; α 2 if d is stored on a resource on the same cluster c r = c s ; α 3 if d is on the same Grid instance g r = g s ; and α 4 if d is on another Grid instance g r g s. In case both resources are not able to exchange data directly, each resource needed to transfer the data set d from s to r is also considered. FUNCTION f s (r R, s R, d D + ) Choose shortest path s 1,..., s k from s to r with s 0 = s so that (s 0, s 1 ),..., (s k, r) N t size(d) k i=0 sc s i /sp si if r = s then t α 1 t else if c r = c s then t α 2 t else if g r = g s then t α 3 t else t α 4 t end if return t END FUNCTION As can easily be seen the data transfer scheduling weights α 1 to α 4 may be chosen to approximate the actual network bandwidth topology and cost or can be used minimize inter-cluster or inter-organization transfers. FUNCTION f c (p P, r R) if r fullfils not all requirements of p then return end if nc r number of CPUs of r

4 FUNCTION DOHS (p P, R, D D +, α 1 4, β 1 2 ) R m { ˆr ˆr R f c (p, ˆr) = min{f c (p, r) r R cˆr = c r } } R s { r r R D f c (p, r) < } Z for all r R m R s do t r 0, S r for all d D do Find ŝ R d with f s (r, ŝ, d) = min{f s (r, s, d) s R d } t r t r + f s (r, ŝ, d) S r S r {ŝ} end for Z Z {( r, S r, t r, f c (p, r) )} end for t min min{ t (r, S, t, c) Z } c min max{ c (r, S, t, c) Z } Find (ˆr, Ŝ, ˆt, ĉ) Z with β 1 tmin ˆt return (ˆr, Ŝ, ˆt, ĉ) END FUNCTION + β 2 cmin ĉ = max{β 1 tmin t + β 2 cmin c (r, S, t, c) Z} Figure 1. DOHS algorithm for scheduling a program with multiple data sets u r currently reserved CPUs of r cpu r CPUs speed of r cpu min min{cpu o o R nc o > u o } if nc r = u r then cp r cpu r else if nc r u r o R with nc o > u o then cp r (nc r cpu min )/(nc r + u r ) else cp r (nc r cpu r )/(nc r + u r ) end if return cc r /cp r END FUNCTION The compute overhead function f c 15 returns the compute power to cost ratio of the resource with respect to program p. If r does not fulfill all requirements of p the function returns. In case there is at least one free CPU, the compute power is simply the CPU speed of the resource. If all CPUs are used but another resource in the Grid has a free CPU, the compute power is defined as the product of the number of CPUs times the CPU speed of the slowest available resource divided by the number of CPUs plus the reserved CPUs. Using the speed of the slowest available resource ensures, that the compute power of a busy resource is never higher then the compute power of a resource with free CPUs. If there is no free CPU in the Grid, the compute power is defined using the resource s CPU speed. IV. SIMULATION RESULTS To evaluate the presented algorithm we developed a simulation environment for executing data- and computeintensive jobs in Grids with additional Cloud resources. In the first step of a simulation a random number of resources (> 100), clusters and organisations are created. Compute, storage and network speeds of each resource are randomly generated as well. Also the number of jobs, the number of computations per MB and the data sets per job are generated randomly. In the next step different scheduling algorithms are used to schedule the jobs as they arrive over time in the created Grid environment. For each algorithm the resulting schedule is simulated and the exact cost and time consumption of each job is computed. Based on the consumed cost and time F (J) is calculated for each algorithm. Following the Monte Carlo simulation approach 20 of these simulations were conducted and the j J C j K j of each was recorded. As a benchmark we use a brute-force algorithm evaluating all possible resource combinations for each job minimizing different objective functions. As this requires n s m (n compute resources, jobs have in average m data sets and each data set is stored in average on s resources) objective function evaluations it is not feasible to use this algorithm for real-world scheduling. But as a benchmark, it ensures that the (locally) optimal resources are chosen according to the objective function. In contrast to the developed algorithm the benchmark algorithm is also provided with all environment and job information, including exact network bandwidth and the job s execution time. The

5 Sum Cost Sum Cost x Completiontime benchmark algorithm is used to compute a schedule based on the following objective functions that are commonly used for grid scheduling: Cost-Time. Minimize the product of the jobs cost and completion time. Time. Minimize the completion time of the job. Cost. Minimize the cost of the job. Transfer. The scheduler chooses an execution resource that has a minimum transfer overhead. Time-Grid. Minimize the completion time of the job based the assumption that transfer overhead and cost is zero within an organisation. As shown in Figure 2 the developed DOHS scheduling algorithm provides good performance compared to the benchmark algorithm using the different objective functions. The DOHS algorithm achieves almost the same performance as the benchmark algorithm using the Cost-Time objective function and all information. The Time-Grid objective function may be regarded as a representative of current Grid schedulers that assume that transfer overhead and cost is zero within an organisation. It does not only require the more execution time but also consumes the more costs as the Time objective function, especially for data-intensive applications. Only the Transfer objective function, which can be seen as a generalization of the MapReduce scheduling approach of minimizing the transfer overhead, is worse. We evaluated the algorithms for a mix of compute- and dataintensive jobs. Figure 2 a) shows two different values for the DOHS algorithm. The DOHS-1 represents a scenario where users classify the jobs as compute-intensive (β 1 = 0.9, β 2 = 1) or data-intensive (β 1 = 1, β 2 = 0.9). DOHS- 2 shows the results for a scenario where users do not provide any classification (β 1 = 1, β 2 = 1). As the DOHS algorithm provides parameters to adapt to different resource environments and job characteristics it was configured with the following parameters: α 1 = 1, α 2 = 1.5, α 3 = 5 and α 4 = 50. V. RELATED WORK Recently, various systems and approaches to grid-based data mining and data-intensive scheduling have been reported in the literature. Some of those that are particularly relevant to this work are briefly reviewed here. The GridBus resource broker [15] provides functions for scheduling data- and compute-intensive applications. In combination with the Storage Resource Broker[16] GridBus is able schedule data-intensive jobs based on various different metrics, including network bandwidth and utilization. The GridBus scheduler as most heutistic Grid schedulers, including the DIANA scheduler [9], follows the common separation between storage and compute resources, requires detailed information about the jobs and the environment and also assumes that transfer overhead within an organization is zero. 1E ct DOHS-1 DOHS-2 t c tg tr (a) j J C j K j Figure 2. Sum CompletionTime (b) j J K j vs j J C j Scheduling simulation results. Another class of Grid schedulers uses genetic algorithms to solve the data-intensive scheduling problem [17][18]. The main disadvantages of these approaches are the computational complexity of the genetic algorithm and the requirement to have detailed information about the environment. Hadoop [5] is the most well known open source implementation of Google s MapReduce paradigm. Hadoop s MapReduce framework is build on top of the Hadoop distributed file system (HDFS) containing all data to be mined. The map and reduce functions are typically written in Java, but also executables can be integrated via a streaming mechanism. MapReduce frameworks like Hadoop do not offer the functionality to efficiently execute computeintensive applications on a cluster, making them unsuitable for a general-purpose data mining system. Hadoop On Demand in combination with the SUN Grid Engine try to overcome these limitations by running Hadoop on top of CT DOHS T C TG TR

6 a cluster management system, thus adding another layer of complexity. Still, the resources to use for MapReduce are reserved exclusively for Hadoop and can not be used by other compute-intensive jobs. Hadoop and similar MapReduce frameworks simplify the development and deployment of data-intensive applications on local clusters and cloud resources but are currently not suited for large-scale, heterogeneous environments comprised of multiple independent organizations. Anteater [1] is a web-service-based system to handle large data sets and high computational loads. Anteater applications have to be implemented in a filter-stream structure. This processing concept and its capability to distribute finegrained parallel task make it a highly scalable system. Due to the restriction on a filter-stream structure Anteater shares some downsides of MapReduce frameworks: Applications have to be ported to this platform which makes it almost impossible to integrate existing applications. VI. CONCLUSION In this article we introduced a multi-objective scheduling algorithm for data-intensive applications in Grid environments. The new concept of combined Grid resources in combination with the developed data location aware scheduling algorithm provides an infrastructure to build scalable dataintensive applications in worldwide, heterogeneous environments. The scheduling algorithm also supports computeintensive applications so that a single environment can be used for both data- and compute-intensive applications. In addition the DOHS algorithm is specifically designed for Grid environments with Cloud resources where information is generally scarce. The simulation results show that the algorithm is competitive or even surpasses current Grid schedulers requiring detailed information. Future work may focus on additional Cloud related topics such as setup time of a resource or dynamic cost. REFERENCES [1] D. Guedes, W. Meira, and R. Ferreira, Anteater: A serviceoriented architecture for high-performance data mining, IEEE Internet Computing, vol. 10, no. 4, pp , [2] S. Datta, K. Bhaduri, C. Giannella, and H. Kargupta, Distributed data mining in peer-to-peer networks, IEEE Internet Computing, vol. 10, no. 4, pp , [3] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, 2004, pp [Online]. Available: [4] S. Ghemawat, H. Gobioff, and S. T. Leung, The google file system, SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp , [5] T. White, Hadoop: The Definitive Guide, 1st ed. O Reilly Media, [6] V. Stankovski, M. Swain, V. Kravtsov, T. Niessen, D. Wegener, M. Röhm, J. Trnkoczy, M. May, J. Franke, A. Schuster, and W. Dubitzky, Digging deep into the data mine with datamininggrid, IEEE Internet Computing, vol. 12, no. 6, pp , [7] A. Congiusta, D. Talia, and P. Trunfio, Distributed data mining services leveraging wsrf, Future Generation Computer Systems, vol. 23, no. 1, pp , [8] B. Peter and W. Alexander, Grid-aware approach to data statistics, data understanding and data preprocessing, International Journal of High performance Computing and Networking, vol. 1, no. 6, pp , [9] R. McClatchey, A. Anjum, H. Stockinger, A. Ali, I. Willers, and M. Thomas, Data intensive and network aware (diana) grid scheduling, Journal of Grid Computing, vol. 5, pp , [10] S. Venugopal and R. Buyya, A set coverage-based mapping heuristic for scheduling distributed data-intensive applications on global grids, in In proceedings of the 7th IEEE/ACM International Conference on Grid Computing(Grid06. IEEE CS press, [11] K. Ranganathan and I. Foster, Decoupling computation and data scheduling in distributed data-intensive applications, in HPDC 02: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing. IEEE Computer Society, 2002, pp [12] M. Röhm, M. Grabert, and F. Schweiggert, A generalized mapreduce approach for efficient mining of large data sets in the grid, in 1. International Conference on Cloud Computing, GRIDs, and Virtualization, CLOUD COMPUTING 2010, Lisbon, Portugal, 2010, pp [13] S. G. Akl and F. Dong, Scheduling algorithms for grid computing: State of the art and open problems, School of Computing Queen s University, Kingston, Ontario, Canada, Tech. Rep , January [Online]. Available: [14] M. Röhm, M. Grabert, and F. Schweiggert, An integrated approach for data- and compute-intensive mining of large data sets in the grid, International Journal On Advances in Intelligent Systems. [15] S. Venugopal, R. Buyya, and L. Winton, A grid service broker for scheduling e-science applications on global data grids, Concurrency and Computation: Practice and Experience, vol. 18, pp , [16] A. Rajasekar, M. Wan, and R. Moore, Mysrb & srb: Components of a data grid, in HPDC, 2002, pp [17] T. Phan, K. Ranganathan, and R. Sion, Evolving toward the perfect schedule: Co-scheduling job assignments and data replication in wide-area systems using a genetic algorithm, in JSSPP, 2005, pp [18] A. K. M. K. A. Talukder, M. Kirley, and R. Buyya, Multiobjective differential evolution for scheduling workflow applications on global grids, Concurr. Comput. : Pract. Exper., vol. 21, no. 13, pp , Sep

A Generalized MapReduce Approach for Efficient mining of Large data Sets in the GRID

A Generalized MapReduce Approach for Efficient mining of Large data Sets in the GRID Matthias Röhm, Matthias Grabert and Franz Schweiggert Institute of Applied Information Processing Ulm University Ulm,