A Distributed System for Genetic Linkage Analysis. Mark Silberstein

Size: px

Start display at page:

Download "A Distributed System for Genetic Linkage Analysis. Mark Silberstein"

Evan Crawford
5 years ago
Views:

1 A Distributed System for Genetic Linkage Analysis Mark Silberstein

3 A Distributed System for Genetic Linkage Analysis Research thesis Submitted in Partial Fulfillment of the Requirement for the degree of Doctor of Philosophy Mark Silberstein Submitted to Senate of the Technion - Israel Institute of Technology Haifa, Israel, June 2010

5 Acknowledgments First and foremost, I thank my advisers Prof. Dan Geiger and Prof. Assaf Schuster for their patient and thoughtful guidance. Not only they found the optimal balance between the tight supervision and open-minded free-flying, but also fostered the confidence in my own ideas, and taught me to see the wood from the trees. I express my deep gratitude to Prof. Miron Livny for his invaluable advice and support. His comments were instrumental in shaping my research on the distributed systems, and his efforts enabled the use of the real large-scale systems such as UW Condor pool and the Open Science Grid, which now became the core computing power of the Superlink-online. I also thank Prof. John Owens from UC Davis, who introduced me to the wonderful world of Graphical Processing Units. The internship in the UC Davis was the key turning point in my research, and without John s help it would not be possible. Prof. Satoshi Matsuoka from the Tokyo Institute of Technology provided invaluable help and support in organizing and sponsoring my stay in Tokyo to work on TSUBAME supercomputer. Throughout my PhD I have had a privilege to work with many talented people, both in Israel and abroad. I am truly grateful to Anjul Patney, Dan Bradley and Naoya Maruyama for their significant contribution and help. Superlink-online system has been the result of the team work. Anna Tzemach, Edward Vitkin, Andrei Anisenia, Irena Kruchkovsky, Oren Shtark contributed a lot to the first and the second version of the system. It would have been impossible without Artyom Sharov, who not only almost completely implemented all the GridBot system, but also shaped the way it is working now. i

6 I would like to thank our colleagues Ohad Birk, Helge Boman, Tzipi Falik, Jacek Majewski, Rivki Ophir, Alejandro Schäffer and Eric Seboun for providing us data that enabled realistic evaluation of our system and to Alejandro Schäffer for valuable user feedback. I would also like to thank my good friend Gabi Kliot for his invaluable advices and encouragement I got along the way. But above all, I want to thank my beloved wife Natalia, my two sons Danny and Benny who were born during the graduate stud es, and my parents Alla and Boris, for their silent patience and understanding, for the help with the kids, for forgiving me for all the time I had not spent with them, and for their endless love and support without which I would not be able to achieve even a promil of what had been achieved. The generous support of the Technion, NIH, Microsoft, Israeli Science Foundation, SciDAC grant, GSIC grant, Fein, Jacobs and Gutwirth Funds is gratefully appreciated. ii

7 Table of Contents Acknowledgments Table of Contents List of tables List of figures i iii vii ix Abstract 1 1 Introduction 4 2 Preliminaries Exact parametric linkage analysis Bayesian networks Grid computing Grid characteristics GPU programming and CUDA High-level programming environment Direct Compute Access Parallelization of linkage analysis for grids Parallelization of multipoint linkage analysis Parallelization of Phase II iii

8 3.1.2 Parallelization of Phase III Execution on grid resources Preventing overload via staged complexity evaluation Minimizing response time via multiple queues Increasing throughput via multiple pools of computers Achieving high reliability via constant monitoring Results Scheduling Mixed Workloads in Multi-grids Related work Model Platform model Application model Submission model Grid-hierarchy scheduling algorithm Grid execution hierarchy Scheduling jobs in a grid hierarchy Handling multiple grids at the same level of the execution hierarchy The application Deployment of Superlink-online Results Utilization of the execution hierarchy Overhead distribution in queues Distribution of jobs in levels Level of parallelism and volatility Discussion Policy-based Scheduling of BOTs on Multiple Grids Related work Terminology iv

9 5.3 GridBot architecture Work-dispatch logic Classads in GridBot Policy-driven work-dispatch algorithm Grid overlay Implementation BOINC Integrating work-dispatch policies Tail phase detection Scalability optimizations Execution clients Results Conclusions Computation of sum-products on GPUs through software-managed cache Related work Background Sum-product Serial MPF kernel Cache-aware performance model User-managed cache Cache design Cache-aware parallel kernel Implementation CPU module Results Experimental setup GPU versus CPU performance Cache performance Conclusions and future work v

10 7 Discussion Improving Bayesian inference computations on GPUs and multicores Hybrid of search and variable elimination Memory efficient task dependency tree traversal Power and time-efficient acceleration of task dependency trees Cost-efficient use of pay-as-you-use grids Incorporating GPUs into GridBot Bibliography Hebrew introduction 150 vi

11 List of Tables 3.1 Resource allocation for the analysis of the pedigree in Figure Summary of experiments on large pedigrees Results of runs in multiple scheduling scenarios Configuration parameters for different queues Resource and job properties in different queues Aggregate statistics per grid for the high throughput Influence of replication policy in collaborative grids vii

12 List of Figures 2.1 A Bayesian network for three-point analysis Performance characteristics of real grids Procedure Find-Order(N, T, L) from [53] Procedure SG(N, T, E) from [53] Procedure Constrained-Elimination(N,Π) from [53] Algorithm Parallel-Find-Order Algorithm Parallel-Constrained-Elimination Pedigree reprinted from [72] Pedigree reprinted from [114] Pedigree with 266 individuals Expected distribution of jobs runtimes on a single CPU Procedure ProcessNewJob Procedure EnforceQueueLimits Superlink-online deployment Job distribution between the hierarchy levels Average accumulated time of jobs in each queue Overhead distribution in different queues Actual distribution of jobs among different levels of hierarchy GridBot high level architecture Example of a typical GridBot classad Scheduling phase: upon job request from host h viii

13 5.4 Replication phase: once in replication cycle Deployment of GridBot for Superlink-online system Naïve execution of BOT in multi-grid High throughput run statistics Scalability benchmark for the number of BOTs in the queue Influence of policies on turnaround time for small BOTs Influence of replication policy in a mixture of grids Computing MPF MPF access pattern for computing different functions MPF kernel pseudocode GPU kernel pseudocode Dynamic unrolling pseudocode Linear-domain performance on random data sets Log-domain speedups on random data sets Cache performance Loop unrolling User-managed vs. texture cache Overhead analysis ix

14 Abstract In this work we consider the challenges in the acceleration of scientific computations via parallel execution using large-scale non-dedicated distributed environments (aka grids) and Graphical Processing Units (GPUs). Our primary motivation has been the acceleration of parametric genetic linkage analysis computations. The main practical outcome of this work is the design and implementation of the distributed system for genetic linkage analysis, called Superlink-online. It is a production online system which serves hundreds of geneticists worldwide allowing for faster analysis of genetic data via automatic parallelization and execution on thousands of non-dedicated computers. Superlink-online literarely realizes the original grand vision of grid computing: the supercomputing power is made a commodity, relying solely on the resources from a multitude of geographically distributed large-scale non-dedicated environments from different administrative domains. Yet, the supercomputing power can be consumed by geneticists with the simplicity of plugging an electric plug into an electric power grid. The first part of this thesis is devoted to the design of a parallel algorithm for computing the probability of evidence in large Bayesian networks, which is the computational problem underlying the genetic linkage analysis. Our algorithm was optimized for execution in non-dedicated large-scale grids. The core of the research focused on the following two issues in distributed and parallel computing, motivated by the needs of the genetic analysis computations: (i) efficient execution of multiple Bags of Tasks (BOTs) of vastly different complexities over multiple grids (ii) efficient execution of memory-intensive workloads with input-dependent 1

15 access pattern on GPUs. The solutions of these two seemingly unrelated problems followed the same guiding principles: (i) identifying high level abstractions that stress the system s core trade-offs (ii) designing generic runtime policy-driven mechanisms for explicit trade-off optimization (iii) optimizing the balance between the policy expressiveness and precision, and the runtime overhead incurred in its evaluation and enforcement. The main contribution in non-dedicated large-scale computing is four-fold. First, we show that by combining the idea of multilevel feedback queue scheduling and host reliability matching, it is possible to achieve low relative slow-down when executing workloads comprising massively parallel and short BOTs on throughput-optimized grids. We then generalize this approach and demonstrate that a handful of basic generic policy-driven mechanisms, such as resource matching, task replication and host-specific ranking, allow to simulate a variety of scheduling algorithms for optimizing different target functions, including the total resource cost, makespan, BOT slowdown, and others. We devise an approximate algorithm for runtime policy evaluation which allows for handling millions of enqueued jobs by clustering them according to the BOT to which they belong. This algorithm enables the implementation of the run-time policy-driven mechanisms at large scale. Finally, we experimentally demonstrate the scalability and efficiency of our system for running linkage analysis workloads over nine different grids and thousands of nondedicated CPUs, achieving the effective throughput equivalent to 8,000 dedicated CPU cores. An important result of our work on GPUs is the introduction of the formal approach for handling memory-intensive workloads with complex memory reuse pattern. We show that by applying structured approach to programming the GPU scratchpad (close-to-alu) memory via software-managed cache, one can create an efficient computational algorithm that uses this cache, and also efficiently implement the policy-driven cache mechanism with low runtime overhead. We demonstrated that for the workloads where the reuse pattern can be determined at runtime on a CPU, the cache policy can be represented as a lookup table, thus saving the costly execution of cache maintenance logic at runtime. This approach improved the application performance by up to an order of magnitude versus the 2

16 implementation without cache. Furthermore, it enabled the computation of the analytical upper bound for the expected GPU performance for these applications, thus characterizing the workloads for which the use of GPUs would not be worthwhile. 3

17 Chapter 1 Introduction Computation of logarithm of odds (LOD) is a valuable tool in mapping disease-susceptibility genes in the study of Mendelian and complex diseases. Successful identification of the affected genes helps provide better prevention and treatment for a disease, and reveals the functional characteristics of genes. Computation of the LOD score, defined as log 10 (L HA /L H0 ), where L H0 is the likelihood under the null hypothesis of no linkage between markers and disease locus, and L HA is the likelihood of linkage, requires efficient methods especially for large pedigrees and/or many markers. A linkage analysis is performed by placing a trait locus at various positions along a map of markers to locate regions that show evidence of linkage and merit further study. To extract full linkage information from pedigree data it is desirable to perform multipoint likelihood computations using all available relevant data jointly. There are two main approaches for computing multipoint likelihoods: Elston-Stewart [48] and Lander-Green [82]. The complexity of the Elston-Stewart algorithm is linear in the number of individuals, but exponential in the number of markers. On the other hand, the complexity of the Lander-Green algorithm increases linearly in the number of markers, but exponentially in the number of individuals in the pedigree. A recently proposed approach is to combine and generalize the previous two methods by using the framework of Bayesian networks as the internal representation of linkage analysis problems [55]. Using this representation enables efficient handling of a wide variety of likelihood computations 4

18 by automatic choice of computation order according to the problem at hand. The computation of exact multipoint likelihoods of large inbred pedigrees with extensive missing data is often beyond the computation capabilities of a single computer. Two complementary approaches can facilitate more demanding linkage computations: designing more efficient algorithms and parallelizing computation to use multiple computers. Both approaches have been pursued over the years. Algorithmic improvements of exact likelihood computations have been reported in [13, 37, 55, 61, 80, 102, 103, 113]. For example, efficient implementation of Lander-Green algorithm by the Genehunter program allows multipoint analysis of medium-sized pedigrees with large number of markers [79,80]; Vitesse V.2 implements optimization for Elston-Stewart algorithm extending its computational boundaries by orders of magnitude [102,103]; Superlink applies enhanced optimization techniques for finding better order of computations in Bayesian networks, making it possible to perform multipoint analysis of larger inbred families [53 55]. Parallel algorithms for linkage analysis have been reported in [36, 44, 46, 62, 77, 95, 97, 109]. Parallel computing was successfully applied to improve the performance of Linkage and Fastlink packages, speeding up the computations by using a set of dedicated processors [46, 62, 77, 97, 109]. Efficient parallel implementations of Genehunter by dividing the computations over high performance processors achieve significant speedups versus the serial version and allows for analysis of larger pedigrees [36]. Despite the advantages of parallel computations, the use of parallel programs for linkage analysis is quite limited. Their execution requires high performance resources, such as a cluster of high performance dedicated machines or a supercomputer. Such hardware can usually be found only in specialized research centers due to its high cost and operation complexity. Furthermore, to be of practical interest and enable complex analyses, the parallel version is required to boost the performance by several orders of magnitude, since the computational complexity is exponential in certain parameters of the genetic data. In this work we describe the algorithms and mechanisms underlying a distributed system for parametric genetic linkage analysis, called Superlink-online. It is an online system which serves hundreds of geneticists worldwide allowing for faster analysis of genetic data via automatic parallelization and execution on thousands of non-dedicated computers. 5

19 The system is capable of analyzing inbred families of several hundreds individuals with extended missing data, outperforming all existing tools for exact linkage computations on such inputs. It is based on the parallelization of Superlink, which is a state-of-the-art program for parametric linkage analysis of dichotomous/binary traits with large inbred pedigrees. Superlink-online delivers the newest information technology to geneticists via a simple Internet interface, which completely hides the complexity of the underlying distributed system. The system allows for concurrent submission of linkage tasks by multiple users, dynamically adapting the parallelization strategy according to the current load and the number of computers available to perform the computations. As the system was developed, it was extensively used by collaborating medical centers worldwide on a variety of real data sets. Since its publication in American Journal of Human Genetics in 2006 [118], Superlinkonline served more than 300 geneticists in leading medical and genetic research institutions worldwide. The system successfully performed over 21,000 individual real-data analyses, some of which resulted in revealing genetic mutations and were published in genetics literature [21, 30, 34, 57, 63, 89, 91, 92, , 120, 126, 129] 1. In Chapter 3 we describe the design and evaluation of the parallel algorithm for computing the probability of evidence in large Bayesian networks, which is the computational problem underlying genetic linkage analysis. Our algorithm was specially designed for the execution in non-dedicated large-scale computing environments (aka grids), which are characterized by presence of many computers with different capabilities and operating systems, frequent failures, and extreme fluctuations of the number of computers available for execution. The key algorithmic issue has been to enable embarrassingly parallel communication-less execution with adjustable sub-task granularity to hide the grid execution overheads. We showed up to two orders of magnitude speedups on the previously infeasible inputs, achieved by execution of the analysis in non-dedicated large-scale resource pool. 1 This research has been awarded the Honorary Mention Award at the Supercomputing 2009 conference in Portland, US, and covered in local and international media including Science Magazine 6

20 Chapter 4 shows the scheduling and allocation algorithm for executing multiple embarrassingly parallel Bags of Tasks (BOTs) with vastly different computational demands on a set of uncoordinated grids. This algorithm, motivated by the characteristics of the real workload observed in Superlink-online system, strives to minimize the turnaround time of shorter tasks and at the same time allow high throughput for the massively parallel ones. We proposed the concept of a grid execution hierarchy, where available grids are sorted according to their size, and the execution overheads increase with the size of the grids. The algorithm finds a grid of the size, availability, and overhead that best matches a task s resource requirements and expected turnaround time. Our approach was inspired by the Shortest Processing Time First policy (SPTF), in the sense that the task s processing demands are constantly reevaluated during its run, so that a task is migrated to a more suitable level of the execution hierarchy when appropriate. The evaluation of this approach in the context of the Superlink-online system allowed nearly interactive response time for shorter tasks, while simultaneously serving throughput-oriented massively parallel tasks in an efficient manner. The growing computational demands of the Superlink-online system, together with the emergence of additional accessible grid environments and the desire to exploit spare CPU cycles of private home computers, called for the extension of the existing execution infrastructure. The new system outlined in Chapter 5, called GridBot, implements a holistic approach for efficient execution of Bags of Tasks on multiple grids, clusters, and volunteer computing grids. GridBot s approach generalized and significantly enhanced the former system, virtualizing the multiple environments under a single unified framework, and enabling arbitrary scheduling algorithms to be implemented as a plug-in to a generic policy-based scheduling mechanism. This mechanism made possible the implementation of dynamic arbitrary scheduling and replication policies which can depend on the system state, task execution state, and task priority. We demonstrated GridBot s capabilities in a production setup as a part of the Superlink-online system. GridBot has executed hundreds of BOTs with over 9 million jobs during three months alone; these have been invoked on 25,000 hosts, 15,000 from the Superlink@Technion community grid and the rest from the Technion campus grid, local clusters, the Open Science Grid, EGEE, and the 7

21 UW Madison pool. Our results show that different scheduling policies combined with the GridBot s efficient execution mechanisms result in up to an order of magnitude reduction in the turnaround time when running a single Bag of Tasks versus the previous version of the system. In Chapter 6 we explore a complementary approach to accelerating the linkage analysis computations by mapping them onto Graphical Processing Units (GPUs). As a part of this effort we proposed a technique for designing memory-bound algorithms with high data reuse on GPUs equipped with close-to-alu software-managed memory. The approach is based on the efficient use of this memory by implementing an application-specific cache in software, and enables analytical performance model of such algorithms. We applied this technique to the design and implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of reallife applications in artificial intelligence, statistics, image processing, and digital communications. In particular, the computation of the probability of evidence in Bayesian networks is an instance of MPF. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low compute-to-memory access ratio. Our GPU-based MPF solver achieves orders-of-magnitude speedups (up to 2700-fold on random data and 270-fold on real-life genetic analysis datasets) on NVIDIA s GeForce 8800GTX GPU over the optimized CPU version on an Intel 2.4 GHz Core 2 with a 4 MB L2 cache, and scales further almost two-fold on the new generation of GTX285 NVIDIA cards. 8

22 Chapter 2 Preliminaries 2.1 Exact parametric linkage analysis Linkage analysis tests for co-segregation between alleles at markers in a chromosomal region and at a trait locus of interest. In parametric linkage analysis the likelihood for evidence of linkage L(θ) is computed under a given model of disease allele frequency, penetrances, recombination fraction between markers, and recombination fraction θ between a disease locus and a reference locus. Computations of L(θ) in this work assume Hardy-Weinberg equilibrium, linkage equilibrium, and no interference, which are common assumptions in most linkage analyses done to date. It is common to consider LOD(θ 1 ) =log 10 L(θ = θ 1 )/L(θ = 0.5) above 3.3 as indication of linkage [83]. Consider a pedigree and let x i denote the observations which include affection status and marker information at one or multiple loci of the ith pedigree member. The likelihood is the probability of the observations defined by L(θ) = P(x 1, x 2,...,x m θ). Elston and Stewart have shown that for simple pedigrees without loops the likelihood may be represented as the telescoping sum L(X θ) = P(x 1 g 1 )P(g 1 ) P(x m 1 g m 1 )P(g m 1 ) P(x m g m )P(g m ), g 1 g m 1 g m where the individuals are ordered such that parents precede their children, and P(g i ) 9

23 represents either the ith child s multilocus genotype given the parental multilocus genotypes, or the probability that a founder individual (no parents in pedigree) has multilocus genotype g i [48]. Elston-Stewart algorithm and its extensions have been employed in many linkage analysis programs. While the early implementations (liped [104], linkage [85, 86]) allow for computation of two and multipoint LOD scores of small pedigrees using few markers, their successors, such as later versions of linkage and fastlink [37], extend their capabilities considerably. The current serial version of fastlink improves the analysis of complex pedigrees by efficient loop breaking algorithms [25]. Further versions of fastlink utilize parallel computing for achieving higher performance [46, 62, 77, 109]. Another example of efficient optimizations is vitesse v.1, which raises the computational boundaries of the Elston-Stewart algorithm and allows for computation of multipoint LOD scores for several polymorphic markers with many unknown genotypes via set-recoding [103]. The complexity of the Elston-Stewart algorithm grows linearly with the size of the pedigree, but exponentially with the number of markers in the analysis, as the algorithm essentially carries out summation over all possible genotype vectors. Also, in the extension for the analysis of complex pedigrees, the loop breaking method grows exponentially in the number of different loop breakers to be considered. Exact multipoint linkage computations involving many markers for small to medium sized families was first made practical with the introduction of the genehunter software [79], that uses the Lander-Green algorithm [82]. While in the Elston-Stewart algorithm the unobserved quantity is the genotype, the Lander-Green algorithm proposes to condition observations on inheritance vectors for each locus, where inheritance vector is a binary vector indicating the parental source of each allele for each non-founder in the pedigree. The Lander-Green algorithm can potentially compute multipoint likelihoods on practically unbounded number of markers. However, the computational capabilities of the algorithm are restricted to the analysis of pedigrees of moderate size, as its computing time and memory requirements scale exponentially in the number of non-founders, which is translated into the number of inheritance vectors to be considered. Later versions of Genehunter optimized the performance by applying the Fast Fourier Transform (FFT) for 10

24 matrix multiplication [80] and by considering only the set of inheritance vectors compatible with the observed genotypes [94]. Allegro [61] and Merlin [13] present alternative implementations of the Lander-Green algorithm, showing speedups of up to two orders of magnitude versus genehunter, achieved through further reduction of the inheritance vector space and application of software optimization techniques. Advances in parallel computing allowed parallelization of genehunter for execution on cluster of high performance workstations, enabling faster computations [36]. Also genehuner-twolocus [121] has been parallelized [44]. Another approach is presented by superlink, where the pedigree data is represented as a Bayesian network, allowing to significantly improve the performance by optimizing the order in which variables are eliminated [54, 55]. As opposed to the Elston-Stewart and the Lander-Green algorithms, where variables are eliminated in a predetermined order, superlink finds the order of variable elimination at run time according to the problem at hand, enabling multipoint likelihood computations of large inbred families, which cannot be carried out by other current linkage analysis software. 2.2 Bayesian networks Bayesian networks [87, 107], also known as directed graphical models, are a knowledge representation formalism that offers a powerful framework to model complex multivariate problems, such as the ones posed by genetic analysis. A Bayesian network is defined via directed acyclic graphs, called DAGs. A DAG is a directed graph with no directed cycles. Each node v has a set of parents Pa v, namely a set of vertices, from which there are edges leading into v in the DAG. A Bayesian network is a DAG, where each vertex v corresponds to a discrete variable X v X with a conditional probability distribution P(X v = x v Pa v = pa v ), and the joint probability distribution P(X) is the product of the conditional probability distributions of all variables. In other words, P(X 1 = x 1,...,X n = x n ) = v P(X v = x v Pa v = pa v ), (2.1) 11

25 Figure 2.1 A Bayesian network for three-point analysis where pa v is the joint assignment {X i = x i X i Pa v } to the variables in Pa v. When Pa v =, the respective term in eq.2.1 reduces to P(X v = x v ). Note that each missing edge represents a conditional independence assertion. This is the key factor for using Bayesian networks to efficiently handle joint probability distributions for large number of variables. An example is shown in Figure 2.1. This example shows a Bayesian network for threepoint analysis (two markers and one disease locus) of a nuclear family with two typed children. The genetic loci variables of individual i at locus j are denoted by G m i,j and Gp i,j for maternal and paternal alleles respectively. Variables E i,j, Si,j m and Sp i,j denote the marker phenotypes (unordered pair of alleles), the maternal and the paternal selector variables of individual i at locus j, respectively. Variable E i denotes the affection status of individual i. Individual 4 is affected as indicated by node E 4. Highlighted nodes represent evidence variables, including available marker phenotypes and affection status. Marker phenotypes 12

26 of the parents are unknown in this example. The quantities P(G m i,j) and P(G p i,j ) represent allele frequencies, P(Si,j m Sm i,j 1 ) and P(Sp i,j Sp i,j 1 ) represent recombination probabilities, and P(E i G m i,j, Gp i,j ) represent penetrances. The joint distribution is the product of all probability tables. The assumptions of no interference, and of Hardy-Weinberg and linkage equilibria are encoded by edges missing in the Bayesian network. More details are in [56]. Bayesian networks are used in this paper for computing the probability of evidence, which is represented as a joint assignment e = {X e1 = e 1, X e2 = e 2,..., X em = e m } to a subset of variables E = {X e1, X e2,...,x em }. It is computable from the joint probability distribution table by summing over all variables X 1,...,X k not in E, namely P(e) =... P (X v = x v Pa v = pa v ), (2.2) x 1 x k v where P is obtained from P by assigning the observed values {e 1, e 2,..., e m } to the respective variables in E. In Figure 2.1 evidence variables such as marker phenotypes and affection status are highlighted. The straightforward approach to compute P(e) by first multiplying all conditional probability tables and then computing all sums is infeasible due to the exponential size of the joint probability distribution. Instead, it is possible to interleave summations and multiplications, summing variables one after another at early stages of the computation by pushing the summation sign in eq.2.2 to the right as far as possible. When summing over all values of a variable X, it suffices to compute the product of only those probability tables which contain X, yielding an intermediate table over the variables of the tables being multiplied. The summation over variable X eliminates it from the product, reducing the dimensions of the intermediate table by a factor equal to the number of values of X. This technique of computing eq.2.2 is called variable elimination in the Bayesian network literature [40]. Variable elimination alone is often inapplicable due to the prohibitively large size of intermediate tables generated during the computations, exceeding the physical memory of contemporary computers An alternative approach to computing eq.2.2 is to simplify a given problem by first assigning values to some subset of variables C X, and then performing the computation for every joint assignment c to the variables in C. Assigning 13

27 a value to variables in C decreases the size of the corresponding probability tables, and consequently the size of intermediate tables, reducing the original problem and fitting it for computation via variable elimination. Eq.2.2 can then be rewritten as P(e) = c ε c, (2.3) where ε c = y {X\C} v P (Y v = y v Pa v = pa v ) represents the computation of the problem for specific joint assignment c to the variables in C. The variables in C are called the conditioning variables and this method is called conditioning [107]. It is used in [84] to extend the Elston-Stewart algorithm to looped pedigrees. Conditioning as described above is inefficient due to repetitive evaluation of identical subexpressions, when computing ε c for different joint assignments c to the conditioning variables. A more efficient algorithm, called constrained variable elimination, significantly reduces the amount of redundant computations by interleaving conditioning and elimination. This algorithm is the basis of the genetic linkage software superlink [54]. The constrained elimination algorithm applies variable elimination until no more variables can be eliminated without exceeding the specified memory constraints. Conditioning is then performed to the smallest subset of variables which suffices to reduce the size of intermediate tables to meet the specified memory constraints. The steps of elimination and conditioning are interleaved until all products and sums have been computed. Both time and space complexity of the constrained variable elimination algorithm depend on the order by which variables are conditioned or eliminated. The running time of the algorithm applied to the same problem can range many orders of magnitude depending on the chosen order of computations. Finding an optimal combined conditioning and elimination order, which minimizes the computation time for general graphs is computationally hard [53]. In fact, the Elston-Stewart and the Lander-Green algorithms can be viewed as instances of a variable elimination method that use predetermined elimination orders, namely, the Elston-Stewart algorithm eliminates one nuclear family after another, whereas the Lander-Green algorithm eliminates one locus after another [53]. Superlink implements a stochastic greedy algorithm for determining the combined conditioning and elimination order, which strives to minimize the execution time under given 14

28 memory constraints [53]. The next variable for elimination or conditioning is selected among the set of variables ranked highest by some choice criteria. The specific elimination variable is selected at random from this set. A single iteration of the algorithm computes an elimination and conditioning order, and its elimination cost, defined as the sum of sizes of all intermediate tables generated during the computation. This procedure is invoked multiple times using various choice criteria, producing a set of orders. Finally the order with the lowest elimination cost is chosen. The order found is optimized for the linkage problem at hand, automatically handling pedigrees of any topology and size as well as missing data and multiple markers. This algorithm is extended to allow efficient parallelization as will be described in Chapter Grid computing The term grid refers to a distributed computing environment with opportunistic best-effort, preemptive resource allocation policies. Namely, the jobs can be preempted by the resource manager at any moment, and neither the amount of available resources nor the time it takes to acquire them is bounded. In contrast, a dedicated cluster (or cluster, for short) is a computing environment with preemption-free allocation policy and short queuing times. We further categorize grids into collaborative and community grids. Collaborative grids are formed by a federation of resources from different organizations, shared among the participants according to some agreed policy, e.g., fair-share. Community grids consist of the home computers of enthusiasts who donate them to one or several scientific projects. Collaborative grids are built as a federation of clusters (not necessarily dedicated, in our terms). Each cluster is managed by the local batch queuing system, which along with the resource allocation policy for its local users, also obeys the global grid-wide user policies. In the following we focus on large-scale collaborative grids such as EGEE [6] and OSG [9]. The internal structure and the resources of each cluster in a grid are hidden behind the gateway node, which is used for job submission and monitoring. The compute nodes often reside on a private network or behind a firewall and local login to the nodes or the gateway is not permitted. The grid users submit jobs directly to the gateway or via Resource Brokers 15

29 (as in EGEE). Community grids rely on home computers around the world. They have been popularized by Berkeley Open Infrastructure for Network Computing (BOINC) [18] used for establishing community computing grids for specific scientific needs. Such a grid is managed by a single work-dispatch server, which distributes the jobs among the BOINC clients. The BOINC client, installed on a volunteer s computer, may connect to multiple work-dispatch servers, effectively sharing the machine between multiple community grids. This crucial feature makes the idea of establishing community computing grids particularly appealing. Indeed, in theory, over three million participating computers can be accessed. The only challenge, which is surprisingly much more complicated, is to motivate their owners to join the newly established grid Grid characteristics In this section we analyze the properties of a number of collaborative and community grids in order to quantify the parameters affecting BOT performance. These will determine the realistic assumptions we can make while designing our solution. In Figure 2.2(a) we present the history of the number of available resources as observed by a user having the steady demand of 1000 jobs, measured during one week in OSG and the UW Madison Condor pool. This short snapshot demonstrates typical behavior observed in opportunistic grids, and highlights the difficulty in providing short-term predictions of the resource availability. Observe the sharp changes (sometimes of an order of magnitude) in the number of allocated resources over short time periods. The growth in the number of resources during the 21st-22nd of February is, however, expected, as it coincides with the weekend. This variability prompts a design that does not rely on static estimates of the number of available resources. Figure 2.2(b) shows the distribution of queuing times of a random sample of about 45,000 jobs invoked in the UW Madison pool and 12,000 jobs in the OSG, measured from the moment the job enters the batch queue until it starts execution. The measurements were performed in the steady state with 100 running jobs. Termination of one job triggered 16

30 submission of a new one. Observe the variations in queuing times, which range from a few seconds to a few hours. Similar findings were reported in [88] for EGEE. These results unequivocally show that obtaining short turnaround time requires special mechanisms to overcome the long delays. Figure 2.2(c) summarizes the failure rate of jobs 20 to 60 minutes long, measured during one month of operation in OSG, EGEE, UW Madison, the Technion cluster (100 cores) and the community grid Superlink@Technion with ~15,000 CPUs. Note that all the jobs executed in collaborative grids experience quite high failure rate due to preemptions, whereas failures due to hardware or software misconfiguration are rare. The community grids, however, have a low preemption rate and frequent hardware and software failures. Thus, any solution for BOT execution will have to be optimized to handle job failures. 2.4 GPU programming and CUDA The modern GPU is a highly data-parallel processor. A GPU features many lightweight closely-coupled thread processors that run in parallel. While the performance of each thread processor is modest, by effectively using many thread processors in parallel, GPUs can deliver performance that substantially outpaces a CPU. The key to achieving high performance in GPUs is to express the computation in a data-parallel manner. The programming model of the GPU is single-program, multiple data (SPMD): many threads are executed in parallel with the same program, processing different data. The GPU is most effective when thousands of threads are available to the hardware at any time; the GPU is capable of quickly switching between these threads to hide latency and keep the hardware busy. The recent introduction of programming environments for the development of nongraphics applications on GPUs facilitated the use of GPUs for high performance computations. One such environment which we use in our work is NVIDIA s CUDA. In a broad view, CUDA s design goals are to provide a high-level programming environment and to enable direct access to the GPU s computing units, as explained below. 17

31 2.4.1 High-level programming environment CUDA programs are based on the C programming language, with extensions to exploit the parallelism of the GPU. CUDA programs are explicitly divided into code that runs on the CPU and code that runs on the GPU. GPU code is encapsulated into a kernel, which exemplifies the SPMD model: it looks like scalar C program, but is invoked concurrently in thousands of threads by the hardware. Kernel code allows arbitrary read-write access to global memory, but synchronization is the responsibility of the user. CUDA also exposes low latency ( 1 cycle) memory shared among a subset of threads, called thread block (up to 512 threads per block). The threads of each block have an exclusive access to a small chunk (16 KB), and no access to the chunks of other thread blocks. No communication among the threads of different thread blocks is permitted Direct Compute Access NVIDIA GPUs feature multiple multiprocessors (16 multiprocessors in the GeForce 8800 GTX), each with 8 thread processors. The GPU is responsible for mapping blocks (specified by the programmer) to these multiprocessors. If enough resources are available, each multiprocessor typically has multiple blocks resident, and can quickly switch between computation on different blocks when appropriate. For instance, if one block starts a long-latency memory operation, the multiprocessor will kick off the memory request then immediately switch to another block while those memory requests are satisfied. Typical CUDA programs will first set up input data on the CPU and transfer it to the GPU, then run the kernel on the GPU data, and finally transfer the result back to the CPU. 18

32 UW Condor OSG Running jobs Feb 19 Feb 20 Feb 21 Feb 22 Feb 23 Feb 24 Feb 25 Feb 26 Feb (a) % of invoked jobs UW Condor OSG Queuing time (seconds) (b) Grid #Jobs Preempted (%) Failed (%) UW Madison % OSG % EGEE % Technion % ( )0.2 13% ( ) Job is counted as preempted if no results were returned before the deadline (c) Figure 2.2 Performance characteristics of real grids (a) Resource availability. (b) Queuing time distribution. (c) Failure rate in different grids. 19

33 Chapter 3 Parallelization of linkage analysis for grids 1 In this chapter we describe the algorithms for distributed large-scale execution of exact linkage analysis on non-dedicated grid resources. The algorithms were implemented and deployed as a part of the superlink-online distributed system, which delivers the technology to geneticists via a simple Internet interface, completely hiding the complexity of the underlying distributed system. The system allows for concurrent submission of linkage tasks by multiple users, dynamically adapting the parallelization strategy according to the current load and the number of computers available to perform the computations. Our results on a real data show improvements in running time of more than two orders of magnitude versus the serial version of the program. This allows users to avoid the undesired breakup of larger pedigrees into smaller pieces, which often weakens the linkage signal. 3.1 Parallelization of multipoint linkage analysis Multipoint LOD score computations are performed in three phases: Phase I. Each pedigree in the input is transformed to a Bayesian network representation 1 Based on the paper [118] 20

34 [56], such as the one given by Figure 2.1. A new Bayesian network is constructed for every position of the disease locus. Phase II. The Find-Order algorithm (Figure 3.1) is applied for each Bayesian network, yielding an elimination order which strives to optimize the computations under given memory constraints. Phase III. Likelihood computations are performed via eq.2.2 for each Bayesian network, by eliminating the variables according to the specified order, yielding the likelihood of the data for a specific disease locus position (see Figure 3.3). Input: A Bayesian network N, threshold T, number of iterations L, a set of l choice criteria C 1,...,C l for choosing the next variable to eliminate. {For example, criterion C 1 is to choose a variable that produces the smallest intermediate table. Other criteria are described in [53]}. Output: An elimination order Π such that the elimination cost of each variable T. {Elimination cost is the size of the intermediate table created by eliminating a variable}. 1. j 1, Cost, Found false 2. For i 1 to L do (a) Run L min iterations using choice criterion C j to compute a candidate elimination order: For k i to i + L min do Π temp SG(N, T, C j ) {this procedure is given in Figure 3.2} Compute the sum of all tables created when using the order Π temp : Cost temp Cost(Π temp ) Update the best order: If Cost temp < Cost then Π Π temp ; Cost Cost temp ; Found true; (b) Switch to the next choice criterion if no order found: If Found is false then j = (j + 1)modl {This way of skipping between criteria enhances the Find-Order procedure of [53]} (c) Found false; i i + L min 3. Return Π Figure 3.1 Procedure Find-Order(N, T, L) from [53] Several levels of parallelism are readily available. First, each Bayesian network created in Phase I is processed concurrently. This alone is insufficient for enabling LOD 21

35 Input: A Bayesian network N with the set of variables V, a threshold T, elimination choice criterion E Output: An elimination order Π such that the elimination cost of each variable T {Elimination cost is the size of the intermediate table created by eliminating a variable}. 1. While V (a) Pick 3 variables {V e1, V e2, V e3 } from V, according to the elimination choice criterion E (b) Choose at random V e {V e1, V e2, V e3 } (c) Compute the elimination cost E(V e ) (d) Decide whether to perform conditioning or eliminate V e If E(V e ) > T then {conditioning} Pick V c V according to the conditioning choice criterion C Add V c to Π as a conditioning variable Remove V c from V Update the Bayesian network N after conditioning by reducing all tables containing V c by the factor equal to the number of values of V c. Else {elimination} 2. Return Π Add V e to Π as an elimination variable Remove V e from V Update the Bayesian network N after elimination by producing a single table with variables from all tables containing V e, and removing these tables from the Bayesian network. Figure 3.2 Procedure SG(N, T, E) from [53] score computations of large pedigrees for a given disease locus position. Thus parallelization is performed even for computing the LOD score for one locus. This change yields a significant algorithmic improvement versus serial computations Parallelization of Phase II The algorithm Find-Order (see Figure 3.1) is an algorithm that yields better elimination orders as more iterations are executed. The execution time of computing an optimized order of likelihood computations should constitute a small fraction of the total running time and has been restricted to 5% in the serial implementation [54]. Clearly, any speed up of such a small part of the computations would only have minimal direct impact on the overall performance. However, since the complexity of the order found is crucial for 22

36 Input: A Bayesian network N, a combined conditioning and elimination order Π Output: Probability of evidence P 1. While Π (a) Pick next variable V from Π (b) Π Π \ V (c) If V is elimination variable (d) Else 2. Return P {elimination} Multiply all tables containing V Sum over V and obtain the result P t If P t is a number {no further summation needed} P P P t foreach value v of V {conditioning} Create N v by assigning V = v in all tables containing V P P+Constrained-Elimination(N v, Π), Figure 3.3 Procedure Constrained-Elimination(N,Π) from [53] the entire likelihood computations, parallelization can be used to considerably increase the number of iterations of Find-Order, yielding orders of significantly lower elimination costs. The algorithm Parallel-Find-Order, presented in Figure 3.4, provides such significant improvement, far beyond the speed up of the optimization phase alone. The input to the algorithm is a Bayesian network N and a threshold T. The threshold represents the amount of memory available for likelihood computations on a single computer and it currently ranges between 10 8 bytes to bytes. Failure to fit the available memory leads to performance degradation and is avoided. In Grid environments with many computers of various capabilities, where it is impossible to predict which computer will be allocated for execution, the threshold T is determined dynamically according to a computer with the minimum amount of memory among the computers available for execution, but above some predefined value. Three steps comprise the optimization algorithm, where each step refines the optimization results of the previous step by applying more iterations of the procedure Find-Order 23

37 Input: A Bayesian network N, a threshold of T bytes Output: An elimination order Π and its cost cost(π) {Step I: Quick complexity evaluation on one computer; takes seconds} Run Find-Order(N, T, L) for L = 5 iterations and report the best elimination order Π 1 Set T 1 to be the average run time of one iteration If cost(π 1 ) < C 1, then output the order Π 1 and the corresponding elimination cost cost(π 1 ) and exit If cost(π 1 ) > C 1, then output task too complex and exit {Step II: Refined complexity evaluation on one computer; takes minutes} Set I 1 to be the number of iterations for the elimination cost cost(π 1 ) via a conversion table for serial execution Run Find-Order(N, T, L) for L = I 1 iterations and report the best elimination order Π 2 If cost(π 2 ) < C 2, then output the order Π 2 and the corresponding elimination cost cost(π 2 ) and exit If cost(π 2 ) > C 2, then output task too complex and exit {Step III: Final complexity evaluation on several computers in parallel; takes minutes to hours} Set k to be the value corresponding to cost(π 2 ) via a conversion table for parallel execution Run k Find-Order(N, T, L) tasks in parallel, each for L = I 1 iterations, using at most k computers, and report the best evaluation order Π 3 found If cost(π 3 ) > C 3, then output task too complex and exit Output the order Π 3 and the corresponding evaluation cost cost(π 3 ) and exit Figure 3.4 Algorithm Parallel-Find-Order (see Figure 3.1). The first step runs five iterations of the Find-Order procedure, often yielding a far from optimal elimination order, especially for high-complexity problems. However if the complexity of the order is below the threshold C 1, the total running time of likelihood computation is small and further optimization is not needed. For problems with higher complexity the second step is executed. The number of iterations now is determined using a conversion table, which associates the complexity cost of the elimination order found in the first step to the total expected running time of the likelihood computations. The number of iterations to be executed in this step is determined so as to allocate 5% of the total expected computing time for the execution of the optimization algorithm, assuming that the average time for a single iteration is the same as during the execution of the first step. For complex problems several thousands of iterations of the Find-Order procedure are executed in this step. When completed, the elimination cost of the order found in this step is reevaluated to check whether additional optimization is required. Indeed, if the elimination cost is below the threshold C 2, further 24

38 optimization is unlikely to yield any significant improvement in the running time, and the optimization algorithm stops. On the other hand, if the complexity found is above a threshold C 2, then the task is rejected. The final optimization step is executed in parallel on several computers, and the total number of iterations depends both on the problem complexity, and on the number of computers available at the moment of execution. Each computer is assigned to perform the number of iterations as was performed in the second step, but the total number of computers can reach one hundred for problems which are on the edge of system capabilities. If, however, the number of available computers drops during the execution, as common in Grid environments, the number of iterations to perform is dynamically decreased in order for the execution time of the parallel ordering phase not to dominate the total revised execution time. The effect of the parallel ordering for large pedigrees is demonstrated by our results, where a speed up of two orders of magnitude is obtained solely due to this phase. Note, that similar increase of the number of iterations in the serial version of the algorithm leads to performance degradation, as the speedup due to better order is outweighed by long running times of the ordering stage Parallelization of Phase III The choice of a parallelization strategy is guided by two main requirements. First, subtasks are not allowed to communicate or synchronize their state during the execution. This factor is crucial for performance of parallel applications in a Grid environment, where communication between computers is usually slow, or even impossible due to security constraints. Second, parallel applications must tolerate frequent failures of computers during the execution by minimizing the impact of failures on the overall performance. These requirements have not been considered in previous parallelization approaches for Bayesian networks software [78, 90]. Our approach is to divide a large complex task into several smaller independent subproblems which can be processed concurrently. As noted previously, the serial algorithm computes the result of eq.2.2 by eliminating variables one by one, and using conditioning 25

39 Input: A Bayesian network N, an elimination order Π, a threshold C Output: Likelihood of data 1. P 2. While the elimination cost cost(π) > C, (a) Choose a conditioning variable X from Π and add it to the set P (b) Remove X from order Π (c) Adjust the Bayesian network N by setting X = x in all probability tables in which X appears 3. Obtain the total number of parallel subtasks L by multiplying the number of values of all variables in P 4. Create L Bayesian networks N i and L elimination orders Π i, one for every joint assignment of the variables in P 5. Run L Constrained-Elimination(N i, Π i ) tasks in parallel using superlink 6. Output the likelihood by summing all partial results Figure 3.5 Algorithm Parallel-Constrained-Elimination when the intermediate results do not fit the memory constraints. However we can apply conditioning before eliminating any variable, as shown in eq.2.3. We calculate ε x for every joint assignment to the variables in C concurrently on several computers, and then obtain the final result by summation as in eq.2.3. Since each subproblem ε x is simpler than the initial problem by a factor up to the number of joint assignments, parallel computation is expected to significantly reduce the running time. To further divide the problem in order to take advantage of more computers, more variables are used for conditioning. We incorporate these ideas in the algorithm Parallel-Constrained-Elimination, presented in Figure 3.5. The input to the algorithm is a Bayesian network N, elimination order Π, found by the Parallel-Find-Order procedure (Figure 3.4), and a threshold C. The threshold C defines the maximum complexity of each subtask, and it is proportional to the running time of a single subtask. The algorithm begins by selecting conditioning variables to be used for parallelization by iteratively adding variables to a set P. In each iteration a new variable is selected from the set of all conditioning variables in the order Π, choosing the next unused conditioning variable according to Π. The order Π and the Bayesian network N are adjusted as follows: the selected variable is removed from Π, 26

40 and the Bayesian network N is modified by setting that variable to specific value in all probability tables in which it appears. Thus, every iteration further simplifies the Bayesian network and reduces the cost of the elimination order. The process continues as long as the elimination cost of the modified elimination order exceeds the maximum allowed complexity threshold C. The number of subtasks L created by this step equals to the number of joint assignments to all variables added to P. After L subtasks are created, they are executed in parallel using the serial Constrained-Elimination procedure ( see Figure 3.3). The final result is obtained by summing the partial results of all subtasks. The choice of the number L of subtasks and their respective maximum size C is crucial for efficiency of the parallelization. The inherent overheads of distributed environments, such as scheduling and network delays, often become a dominating factor inhibiting meaningful performance gains, suggesting that long-running subtasks should be preferred. On the other hand, performance degradation as a result of computer failures will be lower for short subtasks, suggesting to reduce the amount of computations per subtask. Furthermore, decreasing the amount of computations per subtask increases the number of subtasks generated for computing a given problem, improving load balancing and utilization of available computers. Our algorithm controls the subtask size by specifying maximum allowable complexity threshold C. Specifying lower values of C increases the number of subtasks L, decreasing the subtask complexity and consequently its running time. The value of C for a given problem is determined as follows. We initially set C so that a subtask s running time does not exceed the average time a task can execute without interruption on a computer in the Grid environment being used. If such value of C yields that the number of subtasks is below the number of available computers, then C is iteratively reduced to allow division into more subtasks. The lower bound on C is set so that overheads due to scheduling and network delays constitute less than 1% of the subtask s running time. Using conditioning for parallelization results in repetitive evaluation of identical subexpressions by multiple computers, thus reducing the efficiency of the parallel algorithm. The amount of such redundant computations depends on the number of conditioning variables used for parallelization. We found that even for the most complex problems, which are 27

41 split into thousands of subtasks, the overhead does not exceed 10% because of the small number of conditioning variables required for parallelization. Such an overhead is by far overweighed by the performance gains due to splitting a task into a set of independent subtasks, completely avoiding communication and synchronization bet wen them, and by that allowing their execution in an opportunistic Grid environment. 3.2 Execution on grid resources In this section we briefly describe our solution of the technical challenges of the invocation of parallel linkage analysis computations on the grid resources. We provide a more general and comprehensive overview of multi-grid scheduling of divisible loads in the next chapters Preventing overload via staged complexity evaluation Geneticists are not always aware of the computational load induced by a linkage analysis task. Addition of a single marker to the analysis of a large pedigree may increase the running time from only few seconds to months (as demonstrated in the Results section), making the results of little value from a practical perspective, and rendering the system inaccessible to other users. To prevent unintentional overload caused by high complexity tasks, the system rejects tasks exceeding the maximum complexity threshold. To reach the conclusion about task feasibility during early stages of execution, the task complexity is assessed after every step of the Parallel-Find-Order algorithm, as explicated in Figure 3.4. The complexity thresholds C 1, C 2 and C 3 in the algorithm are set according to the amount of computational power available in the Grid environment being used. Overload is also caused by many tasks of reasonable size submitted in a short period of time. To prevent this type of overload, the system load is evaluated before starting each new task and during its execution, rejecting a new task if the momentary system load is too high, without even evaluating the task s complexity. 28

42 3.2.2 Minimizing response time via multiple queues The ability to efficiently handle concurrent computations of tasks of markedly different complexities is crucial for providing adequate performance when servicing multiple execution requests. A viable system must avoid a situation whereby a complex task prevents small tasks from prompt execution by occupying all resources. This situation is common for linkage problems, which can take months of execution time on many computers or seconds on a single one. To allow efficient handling of small tasks while simultaneously serving complex ones, the system classifies the tasks according to their complexity as determined by the Parallel- Find-Order procedure. Each range of complexities forms its own queue, which is handled independently of others and uses different set of computational resources, providing the shortest response time possible under a given system load. The higher the task complexity, the more computational power is employed. While very short tasks are executed on a single dedicated computer without any scheduling delay, the tasks of higher complexity are parallelized and invoked in a pool of few dozens of computers with low remote invocation overhead. Very complex tasks are transferred to the queue of their own, which allows to use potentially several hundreds of computers while exhibiting higher delays in servicing the request. Thus, very short tasks are executed on a single dedicated computer without any scheduling delay; slightly more complex tasks are not parallelized and invoked as a high priority tasks to avoid parallelization overhead; tasks of higher complexity are invoked using full system power. Finally, very complex tasks are transferred to the queue of their own, which while exhibiting higher starting delays, allows to use more computers. If for some reason the task is executed in its queue longer than allowed by the queue specification, it is migrated to a queue for longer tasks preserving all the available partial results. The feature of migrating partially done tasks among a set of queues without loosing the computations done so far, is a novel addition to the Grid computing paradigm, introduced due to the high computational demands of superlink-online. 29

43 3.2.3 Increasing throughput via multiple pools of computers To accommodate high load we implemented mechanisms to allow expansion of the system beyond the boundaries of a single Condor pool of computers. Currently superlink-online spans five pools with a total of 2700 computers, including two pools in the Technion - Israel Institute of Technology in Haifa, Israel, and three large pools in the University of Wisconsin in Madison, USA. Tasks are first submitted to the Technion pools, and then migrate between pool sites according to availability of computers and other load-balancing criteria. We implemented mechanisms to further extend the computational power of superlinkonline by augmenting it with additional pools expected to be contributed by users and institutions worldwide, and used on a free cycle basis without affecting the contributing owners load Achieving high reliability via constant monitoring Distributed environments are far more susceptible to failures than a single computer. It is therefore crucial to reliably detect such failures and notify a user about them in order to prevent situations where a submitted task disappears while a user is waiting for results. It is especially important when the typical running time of a task can be several hours, making an absence of prompt response common, and preventing timely resubmission of the failed task. To solve this, a set of processes constantly monitors the state of the system, and notifies a user if her task fails. Recoverable failures are hidden by automatically resubmitting the task. Critical failures are reported to the system manager to allow correction. 3.3 Results We demonstrate the system capabilities of performing exact LOD-score computations. We use the Grid environment of about 2700 computers of various performance characteristics, though only computers having more than 500 MB of RAM and providing performance 30

44 Figure 3.6 Pedigree reprinted from [72] higher than 300 MFLOPs 2 were used for the execution, reducing the overall number to about Experiment A (Testing correctness). We run superlink-online on all 146 datasets used in [54, 55]. For all these datasets, which differ in their size, number of typed persons and degree of consanguinity, superlinkonline computed correct LOD scores, validating our implementation. Experiment B (Published disease dataset). The pedigree in Figure 3.6 was used for studying the Cold Induced Sweating Syndrome in a Norwegian family [72]. The pedigree consists of 93 individuals, two of which affected, and only four were typed. The original analysis was done using fastlink. The maximum LOD score of 1.75 was reported using markers D19S895, D19S566 and D19S603, with the analysis limited to only three markers due to computational constraints. According to the authors, using more markers for the analysis was particularly important in this study as in the absence of ancestral genotypes, the probability that a shared segment is inherited IBD from a common ancestor increases with the number of informative markers contained 2 The computer characteristics are provided by Condor [123]. The specified characteristics roughly correspond to Intel Pentium IV, 1.7 GHz 31

45 in the segment; that is, a shared segment containing only three markers has a significant probability of being simply identical by state, whereas a segment containing a large number of shared markers is much more likely to be IBD from a common ancestor. Thus, the four-point LOD score on the basis of only three markers from within an interval containing 13 shared markers, is an underestimate of the true LOD score. Study of another pedigree as well as application of additional statistical methods were employed to confirm the significance of the findings. Using superlink-online we computed six point LOD score with the markers D19S895, M4A, D19S566, D19S443 and D19S603, yielding the LOD=3.10 at marker D19S895, which would facilitate the linkage conclusion based on the study of this pedigree alone. Experiment C (Number of computers for multipoint analysis). This experiment explicates the exponential growth of the required amount of computational resources when increasing the number of markers used for the analysis. We performed our computations on the pedigree presented in Figure 3.7, consisting of 105 individuals. It comes from the study of brittle hair syndrome in large Amish consanguineous kindred [114], and originally was analyzed using two-point analysis. We performed two, three and four point analysis using 10-allelic polymorphic markers D7S484, D7S2497 and D7S510 with respective distances 3.3 and 0.5 cm. Table 3.1 summarizes obtained LOD scores at marker DS2497 and the corresponding amount of computational resources required to carry out the computations within the specified time. We were unable to perform five-point analysis due to very high complexity of the computations. Consequently, the respective entries in the table were calculated according to the problem complexity estimation provided by SUPERLINK-ONLINE when the task was processed. Experiment D (Impact of parallelized ordering). We performed two-point analysis of several pedigrees derived from a single large genealogy of thousands of individuals using pedhunter [16]. The pedigree was shrunk by a user of superlink-online until its complexity permitted performing exact computations. 32

46 #Markers LOD score Runtime #Computers sec sec min 82 4 N/A ~100hours ~20000 Table 3.1 Resource allocation for the analysis of the pedigree in Figure 3.7. Five-point analysis could not be completed and the values are estimated. Figure 3.7 Pedigree reprinted from [114] This example demonstrates a sophisticated use of a large genealogy by employing pedhunter and superlink-online in sequel. Such use of a genealogy reduces pedigree errors which may be prevalent when large pedigrees are elicited [135]. The pedigree was initially reduced to contain 231 individuals with 89% untyped and 10% affected. The analysis was performed using a 13-allelic locus, yielding a LOD score of Increasing the pedigree size to 266 individuals with 87% untyped and 8% affected, and performing two point analysis using the same marker yielded LOD=3.65, which indicated that fine mapping is worthy to pursue. This pedigree is depicted in Figure 3.8. Detailed analysis of the execution trace revealed that parallelizing the task of finding the order of computation played a significant role in the overall system performance. Initial estimation of complexity for the pedigree with 266 individuals produced complexity, which would require about 11 CPU years for the analysis to complete. The latest serial 33

47 Pedigree OPediT Figure 3.8 Pedigree with 266 individuals (Produced by OPediT) version, superlink V1.7, reduced the complexity by two orders of magnitude, which would still require about 200 hours to compute on our system given that thousand PCs are available. Finally, the application of the parallelized ordering algorithm yielded additional two order of magnitude reduction in the order complexity, which allowed the system to complete the computations in less than 7 hours. Experiment E (Evaluation of performance). We measured the total run time of computing the LOD score at one disease-locus position for large pedigrees. The results reflect the time a sole user would wait for a task to complete, from submission via a web interface until an notification about task completion is sent back. Results are summarized in Table 3.2. We also compared the running time with that of superlink V1.7, invoked on Intel Pentium Xeon 64bit 3.0 Ghz, 2GB RAM. The entries of the running time exceeding two days were obtained by measuring the portion of the problem completed within two days, as made available by superlink, and extrapolating the running time assuming similar progress. The time saving of the online system versus a single computer ranged from a factor of 10 to 700. In a large multiuser Grid environment used by SUPERLINK-ONLINE, the number of computers employed in computations of a given task may fluctuate during the execution between only few to several hundreds. Table 3.2 presents the average and the maximum 34

48 Running time #Computers #Persons #Markers %Typed LOD score superlink superlink average maximum V1.7 online sec 1050sec sec 520sec hours 2hours min 47min ~300hours 7.1hours min 27min ~138days 6.2hours ~2092sec 1100sec ~231hours 3hours ~160days 8hours Running times for the pedigrees in Figures 3.7, and 3.8 are marked by (*) and (**), respectively. The runs marked by ( ) are performed on the same genealogy by increasing the number of individuals in the analysis. Running times of superlink-online include network delays and resource failures (handled automatically by the system). The column average is computed from the number of computers sampled every 5 minutes during the run. Table 3.2 Summary of experiments on large pedigrees number of computers used during execution. We note that the performance can be improved significantly if the system is deployed in a dedicated environment. 35

49 Chapter 4 Scheduling Mixed Workloads in Multi-grids 1 The vision of the grid as a virtual computer of unlimited capacity is yet to materialize. Rather, access is often granted to multiple uncoordinated grids that vary significantly in their size and performance characteristics. For example, researchers often have access to specialized computational clusters of a few dozen CPUs, in addition to having a few machines dedicated to their research. Organization-wide grids usually allow utilization of idle cycles of many desktop computers and offer a total of several thousand non-dedicated CPUs. National and international grids, which may include several supercomputing centers, typically scale up to tens of thousand CPUs [6]. Finally, SETI@HOME-like communities [17] can potentially harvest cycles from hundreds of thousands of CPUs. We aim at integrating all the grids accessible to a researcher into a single system that will execute a stream of jobs having vastly different computational requirements. We consider divisible problems, which can be divided and sub-divided into any number of asynchronous sub-tasks of desired granularity. The job split into tasks is called a Bag of Tasks or a BOT. The system receives a stream of such jobs, where the total number of operations each such job imposes, called the job complexity, is unknown, although its 1 Based on [115] 36

50 60 50 %Total Jobs <3m 3m-30m 30m-3h 3h-10h 10h-30h >30h Job runtime on single CPU Figure 4.1 Expected distribution of jobs runtimes on a single CPU, based on the statistics from the Superlink-online system distribution is strongly biased towards lower complexity (or, shorter) jobs (see Figure 4.1 for the typical workload distribution as observed in the Superlink-online system). The jobs in the stream may have vastly different complexities, imposing a mixed workload on the executing environment. To achieve reasonable turnaround time, e.g, a few minutes for the shorter jobs and a few days for those of higher complexity, the appropriate parallelization level for a job is dictated by its complexity. An important factor in the performance of a multi-grid system is the choice of a grid for job execution. While the number of CPUs in a grid is critical for obtaining high performance for higher complexity jobs parallelized into massively-parallel BOTs, the response time for shorter jobs depends strongly on the execution overheads of the chosen grid (see Figure 2.2 for the experimental evidence from several large-scale production grids). Such overheads may vary significantly in different grids. While smaller grids are usually used exclusively by a small group of researchers and employ dedicated resources, larger grids are typically shared by hundreds or thousands of users, providing limited quality of service guarantees and being more vulnerable to attacks and failures. In fact, in the common case, overhead and availability seem to trade with the grid size. We identify five 37

51 sources of overheads which commonly increase the cost of running a task on larger grids: Slow and unreliable WAN connections due to geographic and organizational dispersion. Complicated resource management due to a large number of resources to be managed. Sometimes (as in EGEE [6]) a task will pass through several resource brokers and queue managers until it is assigned physical resources. Enforcement of rigid policies due to a large number of users (hundreds or thousands), making it hard for an individual user to improve her priority and gain access to more resources. Extensive security mechanisms such as authorization, authentication, and data encryption. High volatility of resources due to frequent occasional failures and task evictions in favor of higher priority users. In the common case, these considerations result in reduced overheads for scheduling, invocation, and execution on smaller grids, allowing for more predictable execution and faster response. Larger grids, however, are usually tuned to provide high throughput, sometimes at the expense of higher turnaround times and less responsiveness. The problem of using multi-grid environments can be solved by unifying all available grids into a large, flat grid, managed by one of the popular meta-schedulers [2, 8]. This solution, which is common in large-scale grid environments such as EGEE [6], typically uses a first-come first-served (FCFS) policy for a given user: jobs are opportunistically scheduled for execution on available resources that may reside at several different grids. However, this solution may result in all available resources being occupied by an earlyarrived demanding jobs, thus delaying the execution of late arrivers and degrading system response for short jobs. In order to handle mixed workloads, many deployments use a natural extension of the flat approach: short jobs are prioritized by assignment to different FCFS queues [96]. If a 38

52 high complexity job occupies all available resources, some of them will be relinquished in favor of short higher-priority jobs. However, this approach assumes a priori knowledge of job s complexity. A further extension of the flat approach, similar to the multilevel feedback queue (MQ) [71], does not require knowledge of job complexity. It schedules every job at the highest priority queue and moves it to a lower priority queue if the job fails to complete within the queue time limit. In this way, a job will end up being assigned the correct priority according to its complexity. However, high complexity jobs may still be assigned to low-overhead grids, leaving only high-overhead resources for short jobs, which may result in unacceptable turnaround times. In this chapter we describe the combination of MQ with a new concept of the grid execution hierarchy. All available grids are sorted according to their size and overhead: upper levels of the hierarchy include smaller grids with faster response time, whereas lower levels consist of one or more large-scale grids with higher execution overhead. A mixed workload job stream is scheduled on the hierarchy, so that each job is executed at the level that matches the job s complexity. As the complexity increases, so do the computational requirements and the execution overhead that can be tolerated. Consequently, the job to be executed will be placed at a lower level of the hierarchy. A job is first placed at the highest level of the hierarchy, as its complexity is not known upon arrival. If a job fails to complete within the time limit of that level, it is migrated to a lower level of the hierarchy where more resources are available. This process continues until the last hierarchy level is reached or the job terminates. One may wonder as to the reason for searching for the execution level starting from the top of the hierarchy. Indeed, the proper execution level for a given job is easy to determine if the job complexity is simple to compute. However, for the important class of applications that motivated this research, even estimating task complexity is in itself a demanding computational problem. Applications in this class include constraint processing, Bayesian networks inference, and other NP-hard problems, where task complexity estimation is NP-hard [41]. For such applications there exist heuristic algorithms (such as the one in Figure 3.4), that yield an upper bound on the job complexity, whose precision improves 39

53 FCFS FCFS+H MQ MQ+H Long job 6.2h 6.4h 6.1h 6.3h Short job 4.7h 4.2h 3.5m 44s Table 4.1 Results of runs in multiple scheduling scenarios the longer they execute. In our grid execution hierarchy framework, job complexity is reassessed at each level prior to the execution. If the complexity is within the queue complexity range, the job is executed at that level; otherwise it is moved to a lower level. The lower the level, the greater the amount of resources allocated for estimating more precisely its complexity, resulting in a better matched grid size. The characteristics of our approach are highlighted in Table 4.1, where results (averaged over several attempts) of running two jobs, submitted one after another, are given: a long job, parallelized into a BOT with 3000 tasks and executed on a large Condor pool, and a short job of approximately thirty seconds on a single CPU. The jobs are scheduled using either MQ or FCFS. The grids are organized either as a large pool containing all resources (flat), or as a hierarchy (H). Obviously, turnaround time for the longer job is approximately the same for all four combinations of scheduling algorithm and system organization. In contrast, the turnaround time for the shorter job is affected by the higher priority it is assigned by MQ, and is further improved by hierarchical organization, which ensures its assignment to highly responsive resources. We evaluate the grid-hierarchy scheduling algorithm in the context of Superlink-online. In the experiments Superlink-online utilizes about 2700 CPUs located in Condor pools at the Technion in Haifa, and at the University of Wisconsin in Madison. Jobs, submitted via an Internet portal, go through complexity estimation, parallelization, and scheduling on the grids comprising the system. The analysis of the traces of this production system shows that the proposed gridhierarchy scheduling algorithm is able to distinguish jobs of different complexities, and assign them to a grid of appropriate power and overhead. Consequently, even when the system is overloaded with jobs of high complexity, it is still able to support fast, almost 40

54 interactive turnaround times for short jobs, and reasonable completion times for medium complexity jobs. The rest of the chapter is organized as follows. We start with an overview of the related research. We describe the model of the execution environment and the expected workload. We then present the grid-hierarchy scheduling algorithm, followed by implementation details of the production system with which this algorithm is evaluated. Our evaluation is based on the statistics for about 2300 jobs submitted to the Superlink-online system between June and December 2005 by users worldwide. 4.1 Related work Execution of BOTs in grid environments has been thoroughly studied by grid researchers. Running massively parallel jobs in heterogeneous large-scale environments has been the subject of many works (e.g., [24, 27, 28, 65, 74, 132]), that strive to minimize the turnaround time of a single BOT. In particular, [74] addressed the problem of resource management for short-lived BOTs on desktop grids, demonstrating the suitability of grid platforms for execution of short-lived applications. Yet, this work does not deal with multiple grids. Meta-schedulers, such as [2, 8], strive to maximize the overall throughput and system utilization, as opposed to minimizing the job s turnaround time in our work. Sabin et al. [112] suggest an algorithm for scheduling a stream of parallel applications in a multi-grid, assuming availability of reservation capabilities and absence of resource failures, which makes it inapplicable in our setting. Marchal et al. [93] discuss steady-state scheduling of divisible workloads in wide-area grids. However, according to the authors, steady-state analysis ignores the initialization and cleanup phases, which are critical for short-lived jobs. The meta-scheduler component of the GrADS project [38, 127] supports scheduling of multiple jobs. This work assumes the ability to directly invoke and preempt tasks on a given resource, and is not applicable to our case. Still, it highlights the importance of scheduling of multiple jobs for improving the turnaround time of individual job in grids. Another component of this project is the rescheduler [128], which inspired our 41

55 implementation of load sharing between queues. The work in [49] demonstrates the benefits of load sharing in a grid comprised of independently managed supercomputers, where entire parallel jobs are migrated between sites upon decisions made distributively by each queue. This work also influenced our load sharing mechanisms between the same-level grids. A multilevel feedback queue algorithm for time-shared single-cpu systems appeared first in [71]. The authors of [45, 64] analyze the scheduling of jobs with highly variable known processing times on a set of identical servers. The authors show both theoretically and via simulation that a scheduling policy which minimizes the waiting time in the system is one in which each server is assigned jobs of a specific size range, approximating SPTF scheduling. This principle is applied in many production supercomputing environments ( e.g., [52, 66, 96]). Results published in [23] show that SPTF policies do not penalize long jobs, when the job size distribution has a heavy-tail property and the largest 1% of the jobs comprises more than half the load, as in our system. Although these works assume availability of job complexity information and homogeneous servers, they encouraged us to apply similar techniques in our system. 4.2 Model In this section we describe the platform, application and submission models Platform model Grids are managed by local workload and resource management software that cannot be changed or reconfigured in any way. Tasks are submitted into grids in a standard way, through a front-end submission node, and are subject to local policies of the grid. No communication is assumed between the grid resources and the outside world except for submission nodes, due to firewalls separating uncoordinated grids. Faults, crashes, and other related events are handled by the local resource management software. 42

56 4.2.2 Application model Jobs can be divided and sub-divided into any number of independent asynchronous tasks (divisible load jobs [29]). Jobs are parallelized and form a BOT. The BOTs are executed using the master-worker paradigm, where a single master dynamically schedules tasks to multiple workers (see, for example, [58]). Migration of master-worker executions across grids is supported via checkpoint/restart operation of the master component alone. Namely, the master preserves the results of previously terminated tasks as well as the state of its work queue across invocations. Although complete support for checkpoint/restart is desirable, this partial functionality is usually sufficient and more practical Submission model Jobs are submitted independently, forming an incoming stream. The job complexity distribution is heavily biased towards short jobs (e.g., [49, 52, 96]). Statistics gathered from our Superlink-online [118] production system for 2300 jobs show a similar bias (see Figure 4.1). Upon submission, task complexity is unknown. However, an upper bound on task complexity can be computed by a procedure whose accuracy improves together with the amount of resources allocated for its execution, as is the case in the context of genetic linkage analysis (e.g., [53]). 4.3 Grid-hierarchy scheduling algorithm The algorithm has two complementary components: organization of multiple grids as a grid execution hierarchy and procedures for scheduling jobs on this hierarchy Grid execution hierarchy The purpose of the execution hierarchy is to classify available grids according to their performance characteristics, so that resources at each level of the hierarchy provide the 43

57 best performance for jobs of a specific complexity range. In other words, a job of any complexity has a level in the hierarchy which best matches its needs in terms of overhead and available resources. The upper levels of the hierarchy include smaller grids with faster response time, whereas lower levels consist of one or more large-scale grids with higher execution overhead. The number of levels in the hierarchy depends on the expected distribution of job complexities in the incoming job stream, as explained in Section 4.5. Each level of the execution hierarchy is associated with a set of one or more queues. Each queue is connected to one or more grids at the corresponding level of the hierarchy, allowing submission of jobs into these grids. A job arriving at a given hierarchy level is enqueued into one of the queues. It can be either executed on the grids connected to that queue (after being split into BOT for parallel execution), or migrated to another queue at the same level by employing simple load balancing techniques, described later. If a job does not match the current level of the execution hierarchy, as determined by the scheduling procedure presented in the next subsection, it is migrated to a queue at the next lower level of the hierarchy Scheduling jobs in a grid hierarchy The goal of the scheduling algorithm is to find the proper execution hierarchy level for a job of a given complexity with minimum overhead. Ideally, if we knew the complexity of each job and the performance of each grid in the system, we could compute the execution time of a job on each grid, placing that job on the one that provides the shortest execution time. In practice, however, neither the job complexity nor the grid performance can be determined precisely. Thus, the algorithm attempts to schedule a job using approximate estimates of these parameters, dynamically adjusting the scheduling decisions if these estimates turn out to be incorrect. We describe the algorithm in steps, starting with the simple version, which is then enhanced. 44

58 Simple MQ with grid execution hierarchy Each queue in the system is assigned a maximum time that a job may stay in the queue (queueing time) T q, and a maximum time that a job may execute in the queue (execution time) T e, with T e T 2 q. The queue configured to serve the shortest jobs is connected to the highest level of the execution hierarchy, the queue for somewhat longer jobs is connected to the next level, and so on. A job is assumed to be short and thus first submitted to the top level queue. Indeed, while nothing is known about its complexity upon submission, recall that the complexity distribution in the job stream is biased toward shorter jobs. If any of the queue limits is violated, a job is preempted and migrated to the next lower queue (the one submitting jobs to the next hierarchy level). Such an algorithm ensures that any submitted job will eventually reach the hierarchy level that provides enough resources for longer jobs and fast response time for shorter jobs. In fact, this is the original MQ algorithm applied to a grid hierarchy. Avoiding hierarchy level mismatch The simplistic scheduling algorithm above fails to provide fast response to short jobs if a long job is submitted to the system prior to a short one. Recall that the original MQ is used in time-shared systems, and jobs within a queue are scheduled using preemptive round-robin, thus allowing fair sharing of the CPU time [71]. In our case, however, jobs within a queue are served in FCFS manner (though later jobs are allowed to execute if the job at the head of the queue does not occupy all available resources). Consequently, a long job executed in a queue for short jobs may make others wait until its own time limit is exhausted. Quick evaluation of the expected waiting and running times of a job in a given queue can prevent the job from being executed at the wrong hierarchy level. This is accomplished as follows. Each queue is associated with a maximum allowed single job complexity C e 2 For simplicity we do not add the queue index to the notations of queue parameters, although they can be set differently for each queue. 45

59 and a maximum queue workload complexity C q, where queue workload complexity is defined as the sum of the complexities of the jobs in the queue. The queue complexity limits C e and C q are derived from the queue time limits T e and T q by optimistically assuming linear speedup, availability of all resources at all times, and resource performance equal to the average in the grid. The optimistic approach seems reasonable here, because executing a longer job in an upper level grid is preferred over moving a shorter job to a lower level grid, which could result in unacceptable overhead. The following naive relationship between a complexity limit C and a time limit T reflects these assumptions: C = φ(t) = T (N P β), (4.1) where N is the maximum number of resources that can be allocated for a given job, P is the average resource performance, and β is the efficiency coefficient of the application on a single CPU, defined as the portion of CPU-bound operations in the overall execution. By Eq. 4.1, C q = φ(t q ) and C e = φ(t e ). The procedure ProcessNewJob in Figure 4.2 detects the jobs which exceed the queue complexity range and triggers their migration to a lower queue. The procedure comprises three steps. The first step is to detect the queue overload, namely, whether the arriving job violates T q. We denote by Q the estimate of the current queue workload complexity, as computed by the procedure EnforceQueueLimits, which is described later. We denote by C the current estimate of the job s complexity (C can be unknown for new jobs). If the violation is detected, the algorithm triggers an overload migration policy, which migrates the incoming job to the next queue or rejects it without estimating its complexity. In the second step the job complexity is estimated 3. Recall that we assume the availability of a complexity estimation procedure which produces an upper bound on the job s complexity. The longer this procedure executes, the more accurate is the bound. Allocating a small portion α < 1 of T e for complexity estimation often allows quick detection of a job that does not fit in the queue. However, the upper bound on the job s complexity 3 The complexity estimation can be quite computationally demanding for larger jobs, in which case it is executed using grid resources. 46

60 1. Overload detection if (Q > C q ) then call MigrationPolicy(overload) 2. Job complexity estimation if (C is unknown or C > C e ) then C {run complexity estimation for up to αt e } 3. Migration or invocation if (C > C e ) then call MigrationPolicy(complexity mismatch) else leave for invocation in the current queue Figure 4.2 Procedure ProcessNewJob might be much larger than the actual value. Consequently, if the job is migrated directly to the level in the hierarchy that serves the complexity range in which this estimate belongs, it may be moved to too low a level, thus decreasing the turnaround time. Therefore, the job is moved to the next level, where the complexity estimation algorithm is given more resources and time to execute, yielding a more precise estimate. The final step ensures that jobs of complexities that are above C e and would consequently fail to terminate within T e, are not invoked in the queue. The complexity mismatch migration policy is triggered for these jobs, migrating them to the next lower queue, unless the queue is at the lowest hierarchy level, in which case the job is rejected. Enforcing queue limits Using the procedure ProcessNewJob is insufficient to ensure that queue limits are not violated. The main reason for a job to stay in the queue longer than initially predicted is that the grid performance estimates may turn out to be too optimistic, as they do not account for the possible fluctuations of the number of resources due to failures, changes in grid load, and other factors. Thus, the queue time limits must be enforced in a manner similar to that used in the original MQ algorithm, by monitoring the queue and migrating jobs which violate the queue limits. Furthermore, because the complexity of remaining computations for a given job can be determined, jobs which are very likely to violate the 47

61 For every job j, starting from the head of the queue 1. Detect actual violation of queue limits if ( t j q > T q or t j e > T e ) then call MigrationPolicy(overdue) 2. Detect future violation of queue limits return Q (a) Obtain complexity of total remaining computations C (b) Check if enough time remains to complete the job if (C > φ(t e t j e) or C > φ(t q t j q)) then call MigrationPolicy(job potential overdue) (c) Check if enough time remains with the preceding jobs if (Q + C > C q ) then call MigrationPolicy(queue potential overdue) else Q = Q + C Figure 4.3 Procedure EnforceQueueLimits queue limits in the future can be detected. Early detection of such jobs increases the chance that later jobs will complete without migration to lower levels. This is accomplished by the EnforceQueueLimits procedure (Figure 4.3). Each job j stores its queuing time t j q (the time from arrival to the queue until termination) and its execution time t j e (the time from the moment the first job is started by the grid middleware until termination). The procedure EnforceQueueLimits consists of three steps. The first step determines the jobs which actually violate the queue limits, triggering the overdue migration policy, which preempts and migrates these jobs to the next lower queue or terminates them if this is the lowest level. If a job is the only job in the queue, its migration (termination) is delayed until another job arrives. The second step detects jobs which are likely to violate the queue limits. While such jobs have not exhausted their queue limits yet, they will likely be preempted in the future due to too many remaining computations (2.b). 48

62 The third step (2.c) prevents accumulation of jobs in the queue if the jobs at the head of the queue are progressing too slowly and causing later jobs to exhaust their queue limit. Note that the procedure intentionally distinguishes the overdue and job (queue) potential overdue migration policies (2.b, 2.c), to support different system behavior in each case. In the evaluation section we show how disabling these policies affects the queue performance Handling multiple grids at the same level of the execution hierarchy The problem of scheduling in a configuration where multiple grids are placed at the same level of the execution hierarchy is equivalent to the well-studied problem of load sharing in multi-grids. It can be solved in many ways, including using the available meta-schedulers, such as [2], or flocking [124]. If no existing load sharing technologies can be deployed between the grids, we implement load sharing as follows. Our implementation is based on a push migration mechanism (such as in [49]) between queues, where each queue is connected to a separate grid. Each queue periodically samples the availability of resources in all grids at its level of the execution hierarchy. This information, combined with the data on the total workload complexity in each queue, allows the expected completion time of jobs to be estimated. If the current queue is considered suboptimal, the job is migrated. Conflicts are resolved by reassessing the migration decisions at the moment the job is moved to the target queue. Several common heuristics are implemented to reduce sporadic migrations that may occur as a result of frequent fluctuations in grid resource availability [128]. Such heuristics include, among others, averaging of momentary resource availability data with the historical data, preventing migration of jobs with a small number of pending execution requests, and others. 49

63 4.4 The application We have implemented the grid-hierarchy scheduling algorithm in Superlink-online system. The jobs, submitted via the Internet by geneticists from national and international medical research institutions, are scheduled and parallelized by the system for execution in the distributed environment. In this section we extend the description of the parallel linkage analysis implementation on grids provided in Section 3.2. Parallel jobs are executed in a distributed environment via Condor [124], which is a general purpose distributed batch system, capable of utilizing thousands of CPUs. Condor hides most of the complexities of task invocation in an opportunistic environment. In particular, it handles failures that occur because of changes in the system state. Such changes include resource failures, or a situation in which control of a resource needs to revert to its owner. Condor also allows resources to be selected according to the user requirements via a matching mechanism. There are three stages in running master-worker applications in Condor: the parallelization of a job into a set of independent sub-tasks forming a BOT, BOT s parallel execution via Condor, and the generation of final results upon their completion. In our implementation, this flow is managed by the Condor flow execution engine, called DAGman, which invokes the tasks of a BOT according to the execution dependencies between them, specified as a directed acyclic graph (DAG). The complete genetic linkage analysis job comprises two master-worker applications, namely, parallel ordering estimation and parallel variable elimination. To integrate these two applications into a single execution flow, we use an outer DAG composed of two internal DAGs, one for each parallel application. DAGman is capable of saving a snapshot of the flow state, and then restarting execution from this snapshot at a later time. We use this feature for migration of a job to another queue as follows: the snapshot functionality is triggered, all currently executing subtasks in a job are preempted, the intermediate results are packed, and the job is transferred to another queue where it is restarted. 50

64 Performance of a parallel application in opportunistic environment is greatly influenced by granularity of each subtask. To determine the granularity, we implement a simple heuristics, which restricts the maximum and the minimum duration of each subtask as well as the total number of subtasks per single job, according to the actual characteristics of the underlying system, such as execution overhead and the number of resources. Finally, we employ the resource exclusion and resource prioritization techniques [74] for improving the turnaround time in an opportunistic environment, by selecting for execution only the resources with performance above some threshold, and among selected preferring those with higher performance. 4.5 Deployment of Superlink-online The deployment of the Superlink-online portal used in the experiments in this chapter is presented in Figure 4.4. There are three levels of execution hierarchy, where the first (short jobs) level is served by queue Q1, the second (average-sized jobs) level is served by two queues Q2 and Q3, Level 3 for long jobs is served by queue Q4. The jobs exceeding the time limit of Q4 are rejected. We configure three levels of the execution hierarchy, for the following reasons. About 60% of the jobs take a few minutes or less, and about 28% take less than three hours, as reflected by the histogram in Figure 4.1. This suggests that two separate levels, Level 1 and Level 2, should be allocated for these dominating classes, leaving Level 3 for the remaining longer jobs. Yet, the current system is designed to be configured with any number of levels to accommodate more grids as they become available. Each queue resides on a separate machine, connected to one or several grids. Utilization of multiple grids via the same submission machine is enabled by the Condor flocking mechanism, which automatically forwards job execution requests to the next grid in the list of available (flocked) grids, if these jobs remain idle after previous resource allocation attempts in the preceding grids. Queue Q1 is connected to the dedicated dual CPU server and invokes jobs directly 51

65 without parallelization. Queue Q2 resides on a submission machine connected to the flock of two Condor pools at the Technion. However, due to the small number of resources at Q2, we increased the throughput at Level 2 of the execution hierarchy by activating load sharing between Q2 and Q3, which is connected to the flock of three Condor pools at the University of Wisconsin in Madison. The jobs arrive to Q3 only from Q2. Queue Q4 is also connected to the same Condor pools in Madison as Q3, and may receive jobs from both Q2 and Q3. In fact, Q3 exhibits rather high invocation latencies (as can be observed from the overhead analysis in Figure 4.7), and does not fit Level 2 of the execution hierarchy well. Alternatively, Q3 could have been set as an additional level between Q2 and Q4, and the queue time limit of Q2 could have been adjusted to handle smaller jobs. However, because both queues can execute larger jobs efficiently, such partitioning would have resulted in unnecessary fragmentation of resources. Migration allows for a more flexible setup, which takes into account load, resource availability and overheads in both queues, and moves a whole job from Q2 to Q3 only if this yields better job performance. Typically, small jobs are not migrated. However, larger jobs are migrated, as usually benefit from execution in a larger grid as they are allocated more execution resources (see Table 4.3). This configuration results in better performance for smaller jobs than does flocking between these two grids, as it ensures their execution on the low-overhead grid. To ensure that larger jobs of Q4 do not delay smaller jobs of Q3, subtasks of jobs in Q3 are assigned higher priority and may preempt subtasks from Q4. Starvation is avoided via internal Condor dynamic priority mechanisms [124]. The queue constraints T q and T e are configured as detailed in Table 4.2. The intuition behind the current configuration is as follows. The average time for allocation of a single CPU in the grid attached to Q2 is about 20 seconds. Thus, jobs arriving to this queue should be about ten times longer in order for the overhead not to dominate their performance. Consequently, the jobs below 200 seconds should be served in the previous queue, resulting in T e = 3 minutes for Q1. Values of T q are set to prevent jobs from being accumulated in queues, as the users prefer the jobs to be rejected rather than delayed. We restrict the allowed queue length to up to two jobs of maximum duration for Q1, up to two jobs in Q2, 52

66 Figure 4.4 Superlink-online deployment and only one job of Q3 and Q4. The maximum number of available CPUs for a single user is smaller than the total number of resources in the corresponding grids. Out of 200 CPUs in the Technion Condor pools, only 100 satisfy the minimum memory requirements of the application. For the Madison Condor pool, the limit of 500 jobs is due to the recommended value of the maximum number of running jobs concurrently handled by a single submission machine. More jobs cause severe overload of the submission machine and thus are avoided. 4.6 Results We analyzed the traces of 2300 jobs, submitted to the Superlink-online system via Internet by users worldwide for the period between the 1st of June and the 31st of December During this time, the system utilized about 460,000 CPU hours (52.5 CPU years) over all Condor pools connected to it (according to the Condor accounting statistics). This 53

67 Queue T q (min) T e (min) N P (KFlops) Q Q Q Q Table 4.2 Parameters: T q queue waiting time limit, T e queue execution time limit, N maximum available CPUs for a single user, P average performance of computers in grid time reflects the time that would have been spent if all jobs were executed on single CPU. About 70% of the time was utilized by 1971 successfully completed jobs. Another 3% was wasted because of system failures and user-initiated job removals. The remaining 27% of the time was spent executing jobs which failed to complete within the queue time limit of Q4, and were forcefully terminated. However, this time should not be considered as lost since users were able to use partial results. Still, for clarity, we do not include these jobs in the analysis Utilization of the execution hierarchy We compared the total CPU time required to compute jobs by each level of the execution hierarchy relative to the total system CPU consumption by all levels together. As expected, the system spent most of its time handling the jobs at Level 3, comprising 82% of the total running time of the system (see Figure 4.5). The jobs at Level 2 consumed only 17.7% of the total system bandwidth, and only 0.3% of the time was consumed by the jobs at Level 1. If we consider the total number of jobs served by each level, the picture is reversed: the first two levels served significantly more jobs than the lower level. This result proves that the system was able to provide short response time to the jobs which were served at the upper levels of the hierarchy. This conclusion is further supported by the graph in Figure 4.6, which depicts the average accumulated time of jobs in each queue, computed from the time a job is submitted to the system until it terminates. This time includes accumulated overheads, which are 54

68 % total jobs % total system CPU time Level 1 Level 2 Level 3 Execution hierarchy Figure 4.5 Portion of jobs handled by each level of the hierarchy (first column) versus portion of the overall system CPU time utilized by each level (second column) computed by excluding the time of actual computations from the total time. As previously, the graph shows only the jobs which completed successfully. Observe that very short jobs which require less than three minutes of CPU time and are served by Q1 stay in the system only 72 seconds on average regardless of the load in the other queues.this is an important property of the grid-hierarchy scheduling algorithm. The graph also shows average accumulated overhead for jobs in each queue, which is the time a job spent on any activity other than the actual computations. The form of the graph requires explanation. Assuming a uniform distribution of job runtimes and availability of an appropriate grid for each job, the job accumulated time is expected to increase linearly towards lower levels of the hierarchy. However in practice these assumptions do not hold. There are exponentially more shorter jobs requiring up to 3 minutes on single CPU (see Figure 4.8). This induces a high load on Q1, forcing short jobs to migrate to Q2 and thus reducing the average accumulated time of jobs in Q2. This time is further reduced by the load sharing between Q2 and Q3, which causes larger jobs to migrate from Q2 to Q3. Thus, shorter jobs are served by Q2, while longer ones are executed in Q3, resulting in the observed difference between the accumulated times in 55

69 Total time in system(sec) Accumulated runtime Accumulated overhead Condor evictions Condor queueing Q1 Q2 Q3 Q4 Queue Figure 4.6 Average accumulated time (from arrival to termination) of jobs in each queue these queues. To explain the observed steep increase in the accumulated time in Q4, we examined the distribution of running times in this queue. We found that shorter jobs (while exceeding Q3 s allowed job complexity limit) were delayed by longer jobs that preceded them. Indeed, over 70% of the overhead in that queue is due to the time the jobs were delayed because of other jobs executing in that queue. This delay is a result of disabling the potential overdue migration policy in Q4, which is enabled in all other queues. Jobs in Q4 are allowed to run until they actually violate the queue time limits in order to allow generation of partial results, which are valuable in genetic linkage analysis applications. Thus, the queuing times of shorter jobs arriving to Q4 increase, resulting in longer jobs dominating the accumulated time. Availability of additional grids for the execution of higher complexity jobs would allow for the queueing and turnaround time to be reduced Overhead distribution in queues Figure 4.7 provides a more detailed view of the types of overhead in each queue. This includes the invocation and control overheads incurred by DAGman as well as the time 56

70 %Total time in system Condor evictions Condor queueing Local queueing Complexity estimation Migration Condor DAGman 5 0 Q1 Q2 Q3 Q4 Queue Figure 4.7 Overhead distribution in different queues spent on complexity estimation, migration, and waiting for Condor to allocate resources. This last parameter is computed as the time from which the first job is submitted to the time when Condor starts executing it. The major overhead of the jobs in Q1 is due to complexity estimation (11 seconds). In the initial implementation, a job was executed without complexity estimation, but preempted if it turned out to be a long job. However, since the jobs in our system are often submitted in bursts, this resulted in higher load on Q1 and delays for short jobs. The job flow is managed via DAGman, the implementation of which carries its own 4 seconds, because DAGman sometimes requires few seconds to detect termination of jobs in the DAG. Note that Q1 invokes jobs locally, and thus does not suffer any Condor-related overheads. 57

71 Queues Q2 and Q3 serve jobs of about the same size and thus should exhibit comparable overheads. Indeed, the overheads due to DAGman and complexity estimation are almost equal, with the latter being slightly less in Q2 due to the availability of faster CPUs (see Table 4.2). However, Q3 shows significantly higher overheads than Q2 due to long Condor queueing times, reaching several minutes on average, versus 20 seconds on average for Q2. Q4 shows the least overheads (3.5%) in terms of complexity estimation and waiting time in Condor queues, relative to the average job execution time in that queue. However, as was previously explained, insufficient resources caused some of the jobs to be delayed by the presence of long-running jobs that preceded them in the queue. This resulted in long delays due to local queueing. An important factor in the overheads of Q3 and Q4 is the volatility of grid resources, namely, the loss of computations due to evictions. We note that this overhead is significantly higher for these queues, connected to the Condor pool at UW Madison, than for Q2, in which jobs are submitted to the Technion s small grid. The graph in Figure 4.7 shows that in all queues, the overhead of the grid-hierarchy scheduling algorithm and its implementation does not exceed 20% of the total job time in the system even for very short jobs, and significantly less for longer ones, which is a reasonable trade-off for obtaining short response times Distribution of jobs in levels Optimally, a job should be directly invoked at the best level in the hierarchy, acquiring the maximum possible resources at that level (which exactly matches our optimistic assumption in Eq. 4.1). Thus, we call expected job duration the time it would have taken the job to complete if it was immediately given the resources of the best matching level of the hierarchy upon entering the system. Figure 4.8 shows the distribution of the expected job durations, derived from their actual exact complexity, and the execution level in the hierarchy where these jobs were executed in the real system. Ultimately, each class of jobs is supposed to be handled by a level of the hierarchy matching its real complexity. However, 58

72 60 50 Level 1 Level 2 Level 3 %Total Jobs <3m 3m-30m 30m-3h 3h-10h 10h-30h >30h Expected job duration on single CPU Figure 4.8 Actual distribution of jobs among different levels of hierarchy versus their expected duration. Ideally, each column would have a single level color. about 11% of the jobs that are expected to be processed by Level 1 of the hierarchy are migrated and executed at Level 2. Further investigation revealed two reasons: 1) high load in the queue Q1 of Level 1, resulting in automatic offloading of jobs to Q2; 2) too loose an upper bound on the complexity produced at the first stage. As opposed to Level 1, Level 2 performs well for all jobs with expected duration below its time limit, with only few longer jobs moved to Level 3. We found that these jobs were initially invoked at Level 2, but were later moved due to momentary Condor failures. These failures resulted in low availability of resources and caused the jobs to be preempted in accordance with the potential overdue migration policy Level of parallelism and volatility Running parallel jobs in a grid environment is complicated by the inherent volatility of the resources. A job can be evicted at any point of execution and then restarted on another 59

73 Average Q2 Q3 Q4 Volatility (%of runtime) Volatility (%of submitted jobs) Number of allocated resources per job (absolute) Number of allocated resources per job (% of requests) Number of subtasks per job (in BOT) Job duration (min) Table 4.3 Resource and job properties in different queues resource, either from the beginning or from the last checkpoint, if such functionality is supported. Jobs in our application do not support checkpointing, and thus were shortened to minimize the overhead due to evictions. Table 4.3 shows the effect of resource volatility on the performance of our jobs in each queue. We considered only parallel jobs, namely, those which consisted of multiple jobs, and the values are averaged over all jobs in a given queue. We found that for a job running in the Technion Condor pool, about 1% of the job s running time is lost due to evictions (row 1), and 1% of its jobs are evicted (row 2). For similar jobs in Q3 these values are 4.5% and 7% respectively. Since the total average runtime of jobs as well as job durations are similar in both queues, resources in the Madison Condor pool seem to exhibit a higher degree of volatility, confirming our assumption of the size-volatility tradeoff. This is further confirmed by jobs in Q4, where about 14% of all jobs of the job are usually evicted. This value reflects an average of the fluctuations of resource volatility over longer periods of execution of these jobs. Simple calculations show that, provided a dedicated cluster, a job at Q2 could have been completed in about 20 minutes, while the actual results in the opportunistic environment are about 4 times higher. This is in part due to the known phenomenon where, when short jobs are executed in opportunistic environments, the last few jobs of a job may dominate its running time because of resource volatility and heterogeneity [74]. Another important grid property is the amount of resources simultaneously allocated to the same user (row 3 in Table 4.3). We measured this value for each parallel job by 60

74 counting the number of simultaneously executing jobs between invocations of the first and the last submitted jobs, sampling at any invocation or termination event and averaging over all available samples. The intuition is to measure the resource allocations only when a job still has pending job execution requests. We note that this value depends on the total number of job execution requests of a given job. Thus we normalize it by the total number of jobs per job, and average it over all parallel jobs in the queue (row 4 in Table 4.3). These values show that on average, jobs which are scheduled for execution at UW Madison obtain more resources than those scheduled in the Technion Condor pools, justifying the structure of our grid hierarchy. 4.7 Discussion In this Chapter we presented a method for organizing grids and an algorithm for scheduling mixed workloads in multi-grid environments. We implemented the algorithm for the Superlink-online production system and demonstrated that it yields short response times for short jobs even when the system is already loaded with long ones. There are several limitations of the current scheme. While the static approach we use for building the execution hierarchy yields reasonable performance, the volatile nature and properties of grid systems call for dynamic structures. This requires on-the-fly adaptation of the hierarchy to the changing properties of the grids, and a cost model to take into account locality of applications and execution platforms. Furthermore, parallel migration may impose prohibitively high performance costs, and should be avoided. The current framework results in a fragmentation of the grid resources whereby the parallel jobs can be executed only on the resources pertaining the same grid. This problem becomes more severe as the number of different grids grows, limiting the system scalability significantly. Another constraint imposed by the current implementation is its inability to migrate the jobs upward in the hierarchy, from the lower hierarchy levels to the higher ones. Finally, the hard-coded scheduling policy which always prioritizes shorter jobs and delays longer one may be inappropriate for various users scenarios, requiring dynamically 61

75 adjustable scheduling policy to be in place. These limitations motivated the development of a more generalized approach to the scheduling and execution of BOTs in multi-grids, described in the next chapter. 62

76 Chapter 5 Policy-based Scheduling of BOTs on Multiple Grids 1 The grid execution hierarchy approach described in the previous chapter allowed for efficient execution of short BoTs even in the presence of high system load and high resource volatility. Yet, as mentioned in the discussion in the end of the chapter, it suffers from a number of shortcomings, such as grid fragmentation and underutilization. Furthermore, grid hierarchy assumed correlation between the resource reliability and the grid size, which proved to be correct for the grids having orders of magnitude difference in size, but did not allow for classification of resources from similar-sized grids, or resources within the same grid. Finally, the suggested mechanism was not flexible to accommodate different target performance functions of different BoTs. This chapter describes the algorithms and mechanisms for scheduling of Bags of Tasks (BOTs) on multiple uncoordinated grids, and overcomes the above shortcomings, significantly extending, generalizing and enhancing the previously described approaches, BOTs have traditionally been the most common type of parallel applications invoked in grids. Their pleasantly parallel nature enables large scale invocation on the grids, despite slower networks, limited connectivity between geographically distributed resources, and job failures. Grid workflow engines have further strengthened the position of BOTs as the 1 Based on the paper [117] 63

77 dominant type of grid workloads because they enable compound parallel applications with multiple interdependent BOTs [3, 43, 134]. Large grids, such as OSG [9] and EGEE [6], and community grids such as have been very efficient in running throughput-oriented BOTs with thousands or millions of jobs. However, the invocation of moderate-sized, performanceoriented BOTs in large non-dedicated grids often results in higher turnaround times than executing them on a small dedicated cluster [75]. This is because shorter BOTs are more sensitive to the turnaround time of a single job. Their performance is dominated by the slowest job even a single failure increases the turnaround time of the whole BOT. In contrast, in larger BOTs there are enough jobs to keep all the available CPUs busy for maximum available throughput to be achieved. Yet, the transition of a BOT from the high-throughput phase to the tail phase, characterized by the decrease in the number of incomplete jobs toward the end of the run, makes even throughput-oriented BOTs less immune to failures and delays. This throughput-optimized modus operandi of grid environments often makes them less attractive to scientists, who are tempted to build their own dedicated clusters optimized for shorter BOTs, instead of using the grids. However, the required computational demand typically outgrows the limited local resources, in particular if the scientific results prove successful. Thus, the same researchers will eventually need to access additional clusters, cloud computing infrastructures, institutional and international grids, and even end up establishing a community grid of their own. Unfortunately, multiple separately managed grids without a common scheduling mechanism are an impediment to high performance for both shorter and larger BOTs. Static partitioning of a BOT among the grids does not account for sporadic changes in the resource availability, and reduces the number of jobs per BOT in each, decreasing overall efficiency. Thus, the segmentation of the resources requires dynamic job distribution and load-balancing. Further complications arise if the workload comprises a mixture of large and small BOTs, as often happens in grid workflows. For example, better turnaround times will be 64

78 obtained for smaller BOTs if they are scheduled on more reliable resources [75]. Routing BOTs to different grids according to the estimated BOT resource demand, as in the Grid Execution Hierarchy, described in the previous chapter, results in rapid turnaround for smaller BOTs, but only for moderate system loads. Otherwise, the available resources become segmented and performance reduced for large and small BOTs alike. Also, any static policy that does not capture changes in the system state and BOT execution dynamics will be suboptimal. A BOT considered throughput-oriented at one point may become performance-oriented and vice versa, due to the changes in the computational demand of larger BOTs in the tail phase, and fluctuations in grid resource availability. Lastly, budget constraints, emerging in pay-as-you-use cloud computing environments, may require a special resource allocation policy for some BOTs to reduce the costs. Another aspect of multi-bot scheduling is prioritization. For example, a shorter BOT will experience a significant slowdown in a FIFO queue if submitted after a long one, as has been shown also in Table 4.1. Consider also a scenario where the two BOTs are invoked by two different users contributing their own clusters to the system. Clearly the BOTs would be prioritized on the cluster belonging to the BOT owner, with lower priority on the foreign cluster. A simple priority queue, which would solve the problem of slowdown in the first scenario, will not suffice. Contribution. We present a generic scalable mechanism for efficient concurrent execution of multiple arbitrary-sized BOTs in compound multi-grid environments. To the best of our knowledge, this is the first solution which combines several diverse grids in a single monolithic platform supporting flexible runtime policies for large-scale execution of multiple BOTs. First, we unify the grids by establishing an overlay of execution clients, a technique termed overlay computing [4, 15, 110, 130]. While widely used for eliminating long queuing delays and aggregating multiple grids, the existing technologies fall short in grids with strict firewall policies and private networks. Our implementation overcomes this limitation while requiring no prior coordination with grid administrators, or deployment of additional software in the grids. Furthermore, community grid resources are integrated with all the others forming a unified work-dispatch framework. 65

79 Second, we apply several known techniques for achieving rapid turnaround of BOTs, including resource matching, job replication and dynamic bundling [119]. In particular, replication speculative execution of multiple copies of the same job was shown to decrease BOT turnaround time in failure-prone environments [19, 20, 75, 76, 133]. Many of these works devise specific replication algorithms applicable in a certain setup. Our contribution is in the explicit separation of the mechanisms that implement these techniques from the policies that determine when and how the mechanisms are employed by the workdispatch framework. The BOT owner may assign arbitrary runtime policies for each BOT. These polices can depend on the system state, the BOT properties and state, the state of the different job replicas in the BOT, as well as various statistical properties of the resources. The policies can be adjusted during execution to accommodate unexpected changes in user requirements or system state. Third, we enable resource-dependent prioritization policies to be specified for concurrently executing BOTs, so that multi-bot scheduling algorithms can be used [68]. The GridBot system, which implements these policy-driven mechanisms, consists of a work-dispatch server and grid execution clients submitted to the grids by the overlay constructor. Our implementation is based on the BOINC server [18], developed as a part of the middleware for building community grids. Beyond its extensibility and proven scalability, BOINC is the de-facto standard middleware for building such grids. By integrating our mechanisms into BOINC, we make GridBot compatible with the standard BOINC execution clients, making it possible, in principle, to use over three million computers worldwide [1] where these clients are installed. Combined with the other clients dynamically deployed in grids to form the overlay, GridBot creates a unified scheduling framework for standard and community grids. To accommodate large number of resources we applied a number of optimizations for greatly increased scalability. We envision the GridBot system to be employed by workflow engines, such as Pegasus [43] and DAGman [3]. However, our original motivation was to supply the growing computing demands of the Superlink project. GridBot serves as a computing platform for running BOTs from Superlink-online [10], and is deployed in a pre-production setting. It 66

80 currently utilizes resources in the OSG, EGEE, the UW Madison Condor Pool, the Technion campus grid, a number of local clusters, and the Superlink@Technion community grid [11]. During three months, about 25,000 computers worldwide have participated in the computations, with 4,000 from EGEE, 1,200 from Madison, 3,500 from OSG, and the rest from about 5,000 volunteers from 115 countries. GridBot s effective throughput roughly equalled that of a dedicated cluster of 8,000 cores, with theoretical peak throughput of 12 TFLOPs. Over 9 million jobs from about 500 real BOTs were executed, ranging from hundreds to millions jobs per BOT, requiring minutes to hours of CPU time for each job. The total effective CPU power consumed in three months equals 250 CPU years. (Due to the on-demand nature of the workload originating from the Superlink-online Web portal, there were also periods of idle time.) The current GridBot statistics are gathered via an extensive runtime monitoring infrastructure and are available online [7]. In our experiments we demonstrate the flexibility, efficiency and scalability of GridBot for running various real-life BOTs. We also evaluate common replication and scheduling policies on a scale which to the best of our knowledge has never been shown before. 5.1 Related work From the onset of cluster and grid computing research, a number of systems have been developed for execution of BOT-type workloads using application-level scheduling ( APST [32], Nimrod-G [15], Condor Master-Worker [59] among others). The recent works reemphasized the importance of overlay computing concepts (also termed multilevel scheduling) [4, 69, 110, 122, 130]. However, the existing systems do not provide BOT-specific execution mechanisms, leaving their implementation to the application. Nor can they utilize community grids or grids with strict firewall policies. Our approach is to enable the execution of BOTs in compound non-dedicated environments by making the BOT a first-class citizen at the work-dispatch level, thus removing the burden from the application, while allowing for the application-specific policy to be specified. Condor glidein technology [4, 125] is the closest to GridBot in terms of its overlay 67

81 computing and policy specification mechanisms [111]. However it currently lacks BOTspecific functionality in general and replication in particular. Furthermore, private networks and strict firewall policies pose significant obstacles to the use of glideins in standard and community grids. Yet, the success of Condor encouraged us to use classads as the policy language. Falkon [110] achieves remarkable scalability and work-dispatch efficiency, but to the best of our knowledge it does not allow any parametrized policies to be specified. Workflow engines, such as Swift [134], DAGman [3], Pegasus [43] and Nimrod-K [14], provide a convenient way to compose multiple BOTs or jobs into a composite parallel application. All of them allow execution over regular batch or overlay-computing systems, but do not expose the replication policy to the user. The idea of replicating jobs in failure-prone environments was investigated from both theoretical [76] and practical perspectives [12, 19, 35, 73, 133]. These papers propose the algorithms for replication and resource selection to reduce BOT turnaround. These works motivated the design of our replication and scheduling mechanisms and served as examples of policies to be enabled by GridBot. Bundling of multiple jobs was suggested in the context of Pegasus [119] and Falkon [110]. Scheduling heuristics for multi-bot scheduling were investigated by Iosup et al. [68] and Anglano et al [20], and served as a motivating example for our ranking policy mechanism. Integration of different types of grids, including community grids, was also discussed by Cappello et al [31], and further developed by EDGeS [5] project. These works mostly focus on the system infrastructure, as opposed to the user-centric mechanisms of GridBot. 5.2 Terminology The term bag-of-tasks (BOT) refers to a parallel computation comprised of independent jobs. Successful termination of all jobs is necessary for termination of the BOT. Overlay computing is a technique for mitigating the long waiting times in grid queues 68

82 whereby special execution clients are submitted to the grids instead of real jobs. When invoked on the grid resource, such a client fetches the jobs directly from the user-supplied work-dispatch server, thus bypassing the grid queues. 5.3 GridBot architecture The GridBot architecture is depicted in Figure 5.1. It is logically divided into workdispatch logic and the grid overlay constructor. Execution clients in the overlay follow the pull model, whereby they initiate the connection to the server to fetch new jobs, but the server is not allowed to initiate the connection to the clients. We target the case where the traffic initiated from the public network to the clients is entirely disallowed. However, we assume that they can initiate connection to a single port of at least one host in the public space. This assumption holds in the majority of grid environments with which we had a chance to work. The overlay constructor is responsible for submitting new execution clients into the grids whenever there are jobs in the job queue. It determines the number of clients to be submitted to each grid and issues the resource requests to one or more submitters. Note, however, that there are also static (as opposed to dynamically deployed via the overlay) clients which originate in a community grid. They are entirely under the control of the resource owners and contact the server at their will. 5.4 Work-dispatch logic The work-dispatch logic comprises two interdependent components: the generic mechanisms for matching, prioritization, bundling, deadline and replication; and the policy evaluation module for enforcing the user-specified policies controlling these mechanisms. As in Condor, we use classads for the policy specification. Classified advertisements (classads) [111] is a generic language for expressing and evaluating properties. A classad is a schema-less list of name-value attributes. It can be logically divided into a set of 69

83 Work-dispatch server Community grid Work dispatch logic Fetch/generate jobs 1. System state DB 2. Job Queue Execution client Fetch job Update result Fetch queue state Execution client Communication frontend Grid overlay constructor Resource Request Resource Request Execution client Grid submitter Submit Collaborative grid Execution client Grid submitter Submit Dedicated cluster Figure 5.1 GridBot high level architecture descriptive attributes having constant values, as in XML, and functional attributes specifying an arbitrary expression for computing their actual value in runtime. These expressions may include constants, references to other attributes, calls to numerous built-in functions, or nested classads. A classad interpreter enables efficient dynamic evaluation of the functional attributes at runtime, which, coupled with the schema-less nature of the language, opens unlimited possibilities for policy specification Classads in GridBot Every system entity is described as a classad. Here we detail only the most important attributes in each classad, but in practice there are more of them, and new ones can be added. The host classad contains the static and dynamic properties, some of which are reported by the host, such as number and type of CPUs, host owner name, the performance estimates and the number of currently running jobs on this host; and others maintained by 70

84 the work-dispatch server and include long-term statistics, such as the job failure rate, the average turnaround time of jobs on that host, and the amount of CPU time used recently for producing error-free results. The job classad for a non-replicated job has a small set of properties, such as job invocation parameters. However, if there are other running replicas of that job, the classad will be dynamically extended by the work-dispatch mechanism to include the host classad for each such replica. Hence, the scheduling and replication policies can refer not only to the current instance of the job, but to all the hosts executing the other replicas. The BOT classad contains the number of incomplete jobs per BOT, and, most importantly, the Tail attribute, used to monitor the execution phase of the BOT. Tail is dynamically updated by the work-dispatch logic when the transition between the high-throughput and the tail phase occurs. Note that if Tail is used in some policy, the work-dispatch logic affects its own behavior at runtime. We elaborate on the tail phase detection in the implementation section. The queue classad publishes the number of BOTs in the queue, allowing for the policies to refer the current queue load. All the functional attributes, expressing the policies, are placed in the BOT classad, and shared among all the jobs of the BOT. They include JobRequirements, ReplicationRequirements, Rank, Concurrency and Deadline, and will be discussed later. Figure 5.2 presents an example of a compound classad comprising BOT, Queue and Job classads. Observe that the Job classad also contains the classads of the hosts executing its replicas. The meaning of the policies (in bold) is explained below Policy-driven work-dispatch algorithm The work-dispatch mechanism comprises the scheduling and replication phases, described in Algorithm 5.3 and 5.4 respectively. The scheduling phase is invoked upon every job request. First, the host, queue and BOT classads are instantiated. Then, the job queue is traversed, for each job its classad instantiated, and all the policies evaluated given the specific values of job, BOT, host and 71

85 [ Job= [ Name= job1 ; Executable= /bin/hostname ; NumberOfReplicas=2; Replica1= [ Name= job1_1 ; Host=[ Name= is3.myhost ; SentTime=242525; ErrorRate=0.08;] ]; Replica2= [ Name= job1_2 ; Host=[ Name= is2.myhost ; SentTime=242525; ErrorRate=0.08;] ]; ]; BOT= [JobRequirements =!Tail? True: regexp(host.name,/*myhost*/) && Host.ErrorRate<0.1; Rank=!Tail?1: Host.JobsToday; ReplicationRequirements = (NumberOfReplicas<3&& Job.Replica1.Host.ErrorRate>0.1); Concurrency=2*Host.NumCpus; Deadline=Concurrency*2000; JobsLeft=10; JobsDone=5; Tail=true; ]; Queue= [BOTsInQueue=1;] ] Figure 5.2 Example of a typical GridBot classad queue attributes. The goal of the traversal is to find a candidate set of jobs, J, for which JobRequirements evaluate to true. Among the jobs in the candidate set, those having the highest Rank are selected. The number of jobs assigned to a host at any moment is determined by the value of the Concurrency attribute. Before sending the jobs to the host, the deadline parameter for each job is assigned the value of the Deadline attribute. The ability to assign multiple jobs per host allows pipelining, or bundling, used to reduce the per-job invocation overhead for shorter jobs. In the multi-bot case the use of higher Concurrency by the BOT with lower Rank may lead to a violation of the prioritization policy. Hence, the value of the Concurrency attribute of the highest priority BOT is enforced. The replication phase is executed by periodically traversing all running jobs. It is regulated by two policies: the job deadline mentioned above, whose expiration signifies that 72

86 Instantiate classad for h, BOT and queue Foreach job j in the job queue Instantiate classad for j Evaluate Concurrency j, Deadline j, JobRequirements j and Rank j If JobRequirements j = true Add j to candidate set J End Order the jobs in J by Rank j Concurrency Concurrency of a job with maximum Rank Foreach job j J Concurrency min(concurrency, Concurrency j ) If Concurrency < assigned+#running jobs on h deadline j Deadline j Assign job j to host h assigned assigned+1 End End Figure 5.3 Scheduling phase: upon job request from host h the remote resource failed and the job should be restarted, and ReplicationRequirements, used to speed up the computations toward the end of BOT execution. The Replication- Requirements attribute is evaluated only when the number of idle jobs that belong to the specific BOT becomes too low. Without this constraint, the replicated and not yet replicated jobs would contend for the resources, leading to the throughput degradation. While both the Deadline and ReplicationRequirements policies control replication, they serve two completely different goals. The replication of jobs with an expired deadline is necessary in pull-based architectures where the client might not report its failure to the work-dispatch server. The deadline expiration ensures that any job executed by a faulty client will eventually be completed. In contrast, the ReplicationRequirements aim at reducing BOT turnaround time by increasing the likelihood of successful job termination. Several examples of possible policies are presented in Figure 5.2. The matching policy defined by the JobRequirements attribute allows for execution of a job on any host if the BOT is not in the tail phase, otherwise restricting it to those hosts having the string myhost in their names and low error rate. The Rank expression assigns higher relative priority to 73

87 Foreach running job j /*Replication for expired Deadline*/ Check the execution time t of j If t > deadline j Create new replica j and enqueue Mark j as failed continue End /*Replication for speculative execution*/ If few unsent jobs of that BOT in the queue Find all replicas of j and their respective executing hosts Instantiate classad for j If ReplicationRequirements = true Create new replica j and enqueue End End Figure 5.4 Replication phase: once in replication cycle the jobs of this BOT on hosts which recently produced successful results, but this prioritization will be applied only in the tail phase. The ReplicationRequirements policy allows replication only if there are less than three replicas and the first one is running on a host with a high failure rate. The Concurrency expression allows a host to prefetch no more than two jobs for each CPU core. The Deadline attribute assigns the job deadline parameter in accordance with the actual number of jobs sent to the host, and indirectly depends on the host properties in this case. 5.5 Grid overlay The overlay of execution clients is automatically established in the grids in response to the changing resource demand. The grid overlay constructor distributes the client invocation requests between different grids under the following constraints: 1. Each grid restricts the number of concurrently executing or enqueued jobs 2. A grid job must not stay idle on the execution host, as happens when the execution 74

88 client cannot receive new jobs from the work-dispatch server The second constraint is particularly difficult to satisfy when the BOT JobRequirements policy prevents execution of jobs on hosts with specific properties, e.g., a policy which excludes the hosts with a high failure rate. Clearly, this information is inaccessible to grid submitters as it is not maintained by the native grid resource managers. Even if it were imported from the work-dispatch server, large scale grids typically disable fine-grained selection of individual hosts. Our solution is based on two complementary techniques. First, the running client automatically commits suicide if it fails to obtain new jobs from the server or if it detects low CPU utilization by the running job. Second, we allow coarse-grained selection of grids via the BOT GridPolicy attribute. This attribute is evaluated by the overlay constructor in the context of grid classads published by the grid submitters. Once the set of suitable grids is determined, the problem becomes a variation of the classic bipartite graph maximum matching problem, where multiple BOTs must be matched to multiple grids subject to the constraints on the number of available resources in each grid and the resource demand of each BOT. 5.6 Implementation We implemented the work dispatch algorithm and integrated it into the existing BOINC server. We begin with a brief description of the original BOINC work-dispatch logic and then explain our own implementation BOINC BOINC uses standard HTTP protocol for communication between the execution clients and the work-dispatch server. The server is based on the out-of-the-box Apache Web server. Data transfers are performed by the Web server, whereas the control flow is handed over to the custom backend. 75

89 The server does not maintain an open TCP connection with the clients during the remote execution of a job. Rather, clients immediately disconnect after fetching new jobs or reporting results. This design allows for impressive scalability with respect to the number of concurrently executing clients, but results in a delay in client failure detection until the deadline expiration. The server comprises several modules, in particular the scheduler and feeder, which implement the work-dispatch logic. The scheduler handles work requests from clients. This is a latency-critical component whose performance directly affects the system throughput. Thus, in order to hide the latency of accessing the job queue in the database, the feeder pre-fetches the jobs from the database and makes them available to the scheduler via a shared-memory scheduling buffer. The feeder is responsible for keeping this buffer full as long as there are jobs in the queue Integrating work-dispatch policies Scheduling phase Algorithm 5.3 cannot be implemented as is, because it requires the policies to be evaluated on all the jobs in the queue upon each job request. The size of the queue can easily reach a few million, rendering the policy evaluation infeasible. One natural approximation of the algorithm would be to apply it to a random sample of the jobs in the queue. Yet, naïve uniform sampling is not applicable as the BOTs in the queue may have different number of jobs each, hence the larger ones would be overrepresented in the sample. Instead, we apply the algorithm on the sample which includes the jobs of all enqueued BOTs. Hence, if there are n BOTs in the queue, we reserve at least 1/n-th of the scheduling buffer capacity per BOT. To fill the relevant segment of the buffer, the feeder fetches the jobs of the BOT from the queue, redistributing the remaining buffer space among the other BOTs. The jobs remain in the scheduling buffer until they are matched, or until their scheduling time-to-live timer expires. This timer prevents buffer congestion by the jobs with too restrictive JobRequirements. The jobs whose time-to-live timer has expired, are removed from the buffer and returned back to the queue for another scheduling attempt. 76

90 Replication phase. Algorithm 5.4 requires continuous monitoring of the deadline expiration of all running jobs. In practice, the expired jobs are detected via efficient database query without exhaustive traversal. The evaluation of the ReplicationRequirements, however, cannot be offloaded to the database, as it is hard (if at all possible), to map the respective classad expression to the general database query. However, the algorithm evaluates the ReplicationRequirements attribute only when there are not enough enqueued jobs of the respective BOT, hence avoiding the overhead during the high-throughput phase. Furthermore, the feeder selects the candidates for replication via weighted sampling, where the weight is reverse proportional to the number of existing replicas of a job, to first replicate all the jobs having fewer replicas. We also restrict the maximum number of replicas per job to avoid unlimited replication Tail phase detection We consider a BOT to be in the tail phase when the number of its jobs in the queue drops below a certain threshold, usually about the size of the scheduling buffer. Once this condition is satisfied, the feeder updates the Tail attribute in the BOT s classad, making this information available to the work-dispatch logic. The advantage of such tail detection heuristics is that it does not require to estimate the number of the available resources, which cannot be done reliably. The new jobs created as a result of replication (or job failure) may fill the queue again, causing the Tail attribute to turn back to false. Such fluctuations are sometimes undesirable and can be disabled, in particular when the Tail attribute is used to tighten the constraints on the scheduling policies in the tail phase, e.g., by allowing execution on more reliable hosts. On the other hand, the Tail attribute can be used for automatic adjustment of the replication rate if it becomes too high. 77

91 5.6.4 Scalability optimizations System scalability depends mainly on scheduler s ability to quickly choose the set of the jobs having the highest Rank for the requesting host. Since the Rank depends on the host parameters, no advanced indexing is applicable, hence only exhaustive traversal over all the jobs in the scheduling buffer will allow the precise actuation of the ranking policy. The Concurrency attribute further complicates the problem, as the number of jobs to be sent to a host depends on the host properties. The option of reducing the scheduling buffer is unacceptable, as it must be large enough to allow representation of all the enqueued BOTs. Our optimization is based on the observation that the jobs of a single BOT are almost always identical from the scheduling perspective. Indeed, the policies are specified at the BOT level as all the jobs pertaining to the same BOT are assumed to share the same resource requirements. However, for the jobs with multiple running replicas, this similarity no longer exists. The scheduling policy may differentiate between jobs having multiple running replicas by considering the properties of the hosts where these replicas are being executed. One example is when the policy disallows invocation of multiple replicas of the same job in the same grid, in order to distribute the risk. Applying the above optimization reduces the scheduling complexity from O(#jobs in scheduling buffer) to O(#BOTs in scheduling buffer). Also, it enables the rejection of unmatched hosts upfront, which is very important when community grids are part of the infrastructure, as the clients cannot be prevented from contacting the server. This optimization significantly increases the scalability while still covering most of the important scheduling policies, as will be shown in the experiments Execution clients The overlay is formed by BOINC clients submitted to the grid by the grid submitters. However, a few modifications to the existing clients were required, involving some nontrivial changes to allow proper error handling in grids. We focus on the following types of failures: 1. Failures due to faulty resources which continuously produce errors immediately after 78

92 starting the job black holes 2. Network failures or failures due to the excessive server load Black hole elimination requires the client statistics to be stored on the server. The client generates a unique random identifier when first contacting the server, and uses it in all future communications. This identifier is supposed to be stored persistently in the machine where the client is installed. In grids, however, the client state is wiped from the execution machine after preemption, which effectively results in the loss of the server-side statistics. We solved the problem by generating a consistent identifier using the host MAC address. Network failures are frequent in wide area networks, and BOINC clients automatically retry the failed transaction with exponential back-off. In grids, however, the back-off eventually leads to automatic self-termination of the client to avoid grid resource idling. Our attempt to shorten the back-off solved this particular problem, but resulted in an exceedingly high network traffic (which in fact was classified as a DDoS attack) when the real cause of the failure was the server overload and not the network outage. Hence, for large scale deployments, the exponential backoff must be in place even at the expense of efficiency. 5.7 Results The development of GridBot was primarily motivated by the Superlink-online system, which performs statistical analysis of genetic data for detecting defective diseaseprovoking genes in humans and animals. It is accessible via a simple Web interface, and is used by geneticists worldwide. Since 2006, over 18,000 analyses have been performed by the system. The analysis is automatically parallelized and transformed into a BOT with jobs of the required granularity [118]. The computational demands vary significantly among different inputs, ranging from a few CPU seconds to hundreds of CPU years. The experiments described in this section serve three main goals: to compare GridBot with the other alternatives for running BOTs; to evaluate its scalability as a function of 79

Superlink-online WWW server Dedicated cluster Workload Manager Database Work Dispatch Server Technion Campus Grid OSG Figure 5.

flexibility of the policy specification as well as the impact of different scheduling and replication policies in a large-scale multi-grid

Namely, we used the genetic data previously submitted by the users of Superlink-online, effectively re-executing the analysis, in some cases

5. Nor the exclusive access to the resources was granted, neither any of them were reserved during the experiments.

The current deployment features the fail-over dedicated cluster, in addition to the grids.

93 Superlink-online WWW server Dedicated cluster Workload Manager Database Work Dispatch Server Technion Campus Grid OSG Figure 5.5 Deployment of GridBot for Superlink-online system the number of jobs and BOTs in the queue, and the job request rate; and to demonstrate the flexibility of the policy specification as well as the impact of different scheduling and replication policies in a large-scale multi-grid environment. We performed all the experiments using real data from the runs that were previously invoked via Superlink-online. Namely, we used the genetic data previously submitted by the users of Superlink-online, effectively re-executing the analysis, in some cases several times to obtain statistically valid results. The GridBot deployment used for these experiments is shown in Figure 5.5. Nor the exclusive access to the resources was granted, neither any of them were reserved during the experiments. Instead, all the grid resources were allocated on purely opportunistic basis via local grid management infrastructures. The current deployment features the fail-over dedicated cluster, in addition to the grids. Jobs that fail repeatedly in the grids are automatically transferred to this cluster to be invoked in the controlled environment. Naive execution via BOINC overlay. We executed a medium-sized BOT using resources in all available grids. For this experiment we replaced the policy-driven work-dispatch 80

94 server with the unmodified BOINC server. The rest of the GridBot system was left unchanged. The experiment was repeated five times and the best run selected. The graph in Figure 5.6 shows the distribution of the number of incomplete jobs over time. Observe the high consumption in the throughput phase and the slow tail phase. The graph also demonstrates how the Deadline parameter affects the job execution. Deadline was set to three days for all jobs. This was the minimum acceptable value for the volunteers in grid. The reason for such a long deadline is in the structure of community grids in general, most of which assign deadlines of several weeks. Since a single client is connected to many such grids, those with shorter deadlines (less than three days) effectively require their jobs to be executed immediately, thus postponing the jobs of the other grids. This is considered selfish and leads to contributor migration and a bad project reputation, which together result in a significant decrease in throughput. Observe that some of the results were returned more than 30 hours after they were sent for execution. In general, we found that the ratio between execution time and turnaround time for the jobs in the community grid varies between 0.01 to 1, with the average at 0.3 (as opposed to 1 for collaborative grids). The execution of the same BOT by GridBot using the same set of resources required only 8 hours versus 280 by the naive execution, without violating the minimum deadline constraint for community grids. GridBot versus Condor. We compared the turnaround time of a BOT executed via GridBot under the policy to route jobs only to the UW Madison Condor pool, with the turnaround time of that BOT executed directly via Condor in the same pool. Comparison with Condor is particularly interesting since GridBot implements a matching mechanism similar to that of the Condor work-dispatch daemon. This setup gives an advantage to Condor because its work-dispatch daemon is located close to the execution machines in Madison, whereas the GridBot server resides in Israel. To put GridBot under high load, we ran a BOT of 3000 short jobs ranging from 30 seconds to 5 minutes. GridBot was configured with a 10 minute Deadline. The replication policy allowed replication of a job if the failure rate of the running host was above 10%. The BOT was executed five times in each system. 81

95 Incomplete jobs (thousands) GridBot execution ends Unmodified BOINC GridBot BOINC execution ends Time (hours) Figure 5.6 Naïve execution of BOT in multi-grid Grid Throughput (%) #Hosts UW Madison OSG EGEE Technion Superlink@Technion Table 5.1 Aggregate statistics per grid for the high throughput run in Figure 5.7 The average turnaround time in GridBot was 53+/-10 minutes, versus 170+/-41 minutes in Condor, with GridBot faster, on average, by a factor of 3. Less than 1% of the jobs were replicated. This result proves that the execution via GridBot does not introduce any overhead as compared to Condor, and in this case (small BOTs with short jobs) is even more efficient. High throughput run. We invoked a BOT with 2.2 million jobs ranging from 20 to 40 minutes. The BOT was completed in 15 days. The accumulated CPU time (sum of the times measured locally by each execution client, hence excluding communications) used by the system for this run is 115 CPU years. The effective throughput is equivalent to that of a dedicated cluster of 2,300 CPUs. The BOT execution involved all nine available clusters and grids. The contribution by the five main grids is summarized in Table 5.1. Figure 5.7(a) depicts the change in the number of incomplete jobs during the run. The almost linear form of the graph suggests that GridBot consistently managed to recruit a large number of resources despite the high volatility of grid resources. 82

1.5 Incomplete jobs (millions) 1.0 0.5 Network outage 0 19 Feb 20 Feb 21 Feb 22 Feb 23 Feb 24 Feb 25 Feb 26 Feb (a) (b) Figure 5.

96 1.5 Incomplete jobs (millions) Network outage 0 19 Feb 20 Feb 21 Feb 22 Feb 23 Feb 24 Feb 25 Feb 26 Feb (a) (b) Figure 5.7 High throughput run statistics: (a) The number of incomplete jobs over time. (b) Throughput across different grids over time. Figure 5.7(b), which is a snapshot of the online GridBot Web console [7], presents the effective throughput of the system during the last week of this run. In this chart, Community, OSG, EGEE, Madison-condor, T-condor signify the Superlink@Technion community grid, the Open Science Grid, the EGEE grid, the large Condor pool in UW Madison, and the Technion dedicated Condor pool, while all the rest represent different clusters and groups of desktop machines in the Technion campus. In a non-dedicated environment, the number of concurrently executing CPUs cannot be used to estimate the throughput because of the job failures. To obtain a more realistic estimate, we periodically sampled the running time of 1000 recently finished jobs, and multiplied their average by the number of jobs consumed since the last sample. Provided 83

Grid Mashups. Gluing grids together with Condor and BOINC

Grid Mashups. Gluing grids together with Condor and BOINC Grid Mashups Gluing grids together with Condor and BOINC, Artyom Sharov, Assaf Schuster, Dan Geiger Technion Israel Institute of Technology 1 Problem... 2 Problem... 3 Problem... 4 Parallelization From