Creating Personal Adaptive Clusters for Managing Scientific Jobs in a Distributed Computing Environment

Size: px

Start display at page:

Download "Creating Personal Adaptive Clusters for Managing Scientific Jobs in a Distributed Computing Environment"

Karen Phillips
5 years ago
Views:

1 1 Creating Personal Adaptive Clusters for Managing Scientific Jobs in a Distributed Computing Environment Edward Walker, Jeffrey P. Gardner, Vladimir Litvin, and Evan L. Turner Abstract We describe a system for creating personal clusters in user-space to support the submission and management of thousands of compute-intensive serial jobs to the networkconnected compute resources on the NSF TeraGrid. The system implements a robust infrastructure that submits and manages job proxies across a distributed computing environment. These job proxies contribute resources to personal clusters created dynamically for a user on-demand. The system adapts to the prevailing job load conditions at the distributed sites by migrating job proxies to sites expected to provide resources more quickly. The version of the system described in this paper allows users to build large personal Condor and Sun Grid Engine clusters on the TeraGrid. Users can then submit, monitor and control their scientific jobs with a single uniform interface, using the feature-rich functionality found in these job management environments. Up to 1, user jobs have been submitted through the system to date, enabling approximately 9 teraflops of scientific computation. Index Terms Resource management, distributed computing, cooperative systems. T I. INTRODUCTION HE TeraGrid is a multi-year, multi-million dollar National Science Foundation (NSF) project to deploy one of the world s largest distributed infrastructure for open scientific research [1]. The project links eight large resource provider sites with a high speed 1-3 GBits/second dedicated network, providing, in aggregate, over 4 teraflops of computing power, 2 petabytes of storage capacity, and highend facilities for visualization and data analysis of computation results. In particular, the compute resources on the TeraGrid are composed of a heterogeneous mix of compute clusters running different operating systems on different instruction set architectures. Furthermore, most sites Edward Walker is with the Texas Advanced Computing Center, The University of Texas at Austin, Austin TX USA (phone: ; fax: ; ewalker@tacc.utexas.edu). Jeffrey P. Gardner is with the Pittsburgh Supercomputing Center, Pittsburg, PA USA ( gardnerj@psc.edu). Vladimir Litvin is with the High Energy Physics Group, California Institute of Technology, Pasadena, CA USA ( litvin@hep.caltech.edu). Evan L. Turner is with the Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX USA (eturner@tacc.utexas.edu) /6/$2. 26 IEEE 95 deploy different workload management systems for users to interact with the compute cluster. Even when the same workload management system is used, different queue names and job submission limits result in subtly different job submission and management semantics across the different sites on the TeraGrid. Some run-time uniformity exists across sites in the availability of a guaranteed software stack, called the Common TeraGrid Software Stack (CTSS), and a common set of environment variables for locating well known file directory locations. However, the TeraGrid is essentially a large distributed system of computing resources with different capabilities and configurations. The TeraGrid can therefore benefit from middleware tools that provide a common seamless environment for users to submit and manage jobs across sites, and allow its aggregate capabilities to be harnessed transparently. These tools also have to preserve resource ownership across sites, allowing sites to continue leveraging local and global partnerships outside of the project. For example many sites support local university users, national users through NSF programs, and possibly, industrial users through partnership programs. Sites therefore need the autonomy to be able to configure their systems to meet the different needs from these different communities. The middleware tool described in this paper supports ensemble job submissions across the distributed computational resources on the TeraGrid. A recent TeraGrid user survey [2] indicated that over half the respondents had the requirement to submit, manage, and monitor many hundreds, or even thousands, of jobs at once. Two large scientific projects in particular provided the initial motivation for our research: the Compact Muon Solenoid (CMS) particle physics project and the National Virtual Observatory (NVO) astronomy project. Scientists with the Caltech high-energy physics group will be seeking the Higgs boson using the CMS detector in the Large Hadron Collidor (LHC) experiment at CERN [3]. The researchers expect to submit more than a million simultaneous serial jobs on the TeraGrid to enable their research. The NVO project [4] also has a number of applications (such as star formation measurements and quasar spectra model fitting) with job submission requirements of between 5, and 5, simultaneous jobs on the

2 2 TeraGrid. Both projects have large allocations on the TeraGrid. In this paper, we describe the production infrastructure that was developed to support the execution of these two projects on the TeraGrid. The system we describe builds personal clusters for users to submit, manage and monitor computational jobs across sites on the TeraGrid. These personal clusters can be managed by Condor [14],[15], or Sun Grid Engine (SGE) [19], providing a single uniform job management interface for users of the TeraGrid. Furthermore, the system allows TeraGrid resources to be added to existing departmental Condor or SGE clusters, allowing the expansion of local resources on-demand during busy computation periods. The rest of the paper will be as follows. Section II defines our problem statement and examines related work supporting large ensemble job submissions on distributed computing environments. Section III describes our proposed solution introducing the concept of a virtual login session and the use of job proxies to acquire resources for creating a personal clusters on-demand. Section IV measures the overhead of executing user jobs through the job proxies created by the system and section V examines the efficiency of three different job proxy allocation strategies, examining some of their trade-offs. Finally section VI concludes our paper with a discussion of future work. II. RELATED WORK In this paper we examine the question of how to efficiently support multiple users in the submission and management of many thousands of simultaneous serial jobs across a heterogeneous mix of large compute clusters connected in a distributed environment. An approach to submitting ensemble jobs across a distributed infrastructure, like the TeraGrid, is to do so directly through a gateway node at each participating site. This approach uses a tool such as Globus [5][6] or UNICORE [8], to provide a common job submission interface to the different job managers at each site. For many applications, this is a very good strategy. However, despite the scalable nature of these software solutions, our requirement of submitting thousands of jobs by multiple users simultaneously can cause the single gateway resource at each site to become overloaded. Furthermore, future application requirements to submit many small jobs that execute for only a few minutes will further exacerbate this problem. A commonly suggested alternative approach to directly submitting jobs to a gateway node is to submit jobs to some intelligent central agent that schedules jobs to a site based on heuristics that minimize the cost (time and resources consumption) of the entire ensemble execution. These central agents develop job schedules that help throttle the dispatch of jobs to the gateway nodes, alleviating the problem as a side-affect. Tools that do this include Pegasus/Condor-G [21], Nimrod/G [1], and APST [11]. All three systems have been successfully used to schedule job ensembles across a Grid and the scheduling technologies they developed have proved to enable job ensembles to complete more efficiently then simply flooding each gateway node with as many jobs as possible. However, this approach of maintaining jobs at a central agent and dispatching it through a remote execution adaptor has a number of issues in the context of our specific problem. Firstly, a central agent has to deal with remote outages at a gateway node where the only service exposed maybe a job execution service like Globus GRAM [7]. This is difficult because during a gateway outage, a central meta-scheduler can only guess at the current state of their submitted jobs at a site. Secondly, and more importantly, this approach can result in non-optimal use of the cluster resources on the TeraGrid due to the serial nature of our job submission requirements. TeraGrid clusters often have a limit on the number of jobs a user can submit to a site s local job queue, and some local site schedulers even explicitly favor large parallel jobs over serial jobs [12]. Due to these local scheduler policies, some users have resorted to repackaging their serial jobs into a parallel job where multiple CPUs are requested for each submission. However, repackaging serial jobs into a parallel job submission introduces other problems for the user: the user now loses the ability to monitor and control individual jobs in an experiment. This ability to monitor and control individual jobs is an important requirement especially in the case of job failures. Users often need to be notified of a single job failure, requiring the ability to diagnose the problem by examining the error output of the job, and the ability to resubmit individual jobs on problem resolution III. CREATING ADAPTIVE CLUSTERS ON-DEMAND The system we propose automatically submits and manages parallel job proxies across cluster sites, creating personal clusters from resources contributed by these proxies. These personal clusters can be created on a per-user, perexperiment, basis or it can be used to contribute to an existing departmental cluster to extend local computation capabilities. The approach we propose also provides a single job management interface over the heterogeneous distributed environment and leverages existing feature-rich cluster workload management systems. Figure 1 conceptually shows what our proposed solution does. Figure 1 - Distributed proxy agents pull resources into personal clusters created on-demand. Our system builds Condor or SGE clusters when a user creates a virtual login session. Within this virtual login session, users can submit, monitor and manage jobs through a single job management interface, emulating the experience of a traditional cluster login session. The job proxies 96

3 3 (transparently submitted and managed by a cluster builder agent infrastructure) start commodity cluster job-starter daemons (currently from Condor and SGE) which call back to either a pre-existing departmental cluster, or to a dynamically created cluster at the user s workstation. The GridShell framework [16][17] is leveraged to provide a transparent environment to execute the cluster builder agents developed for our system. Our system delegates the responsibility of submitting the job proxies to a single semi-autonomous agent spawned, through Globus, at each TeraGrid site during a virtual login session. This agent translates the job proxy submission to the local batch submission syntax, maintains some number of job proxies throughout the lifetime of a login session, and may negotiate migration of job proxies between sites based on prevailing job load conditions across sites. Our approach has the following advantages: Scalability: user jobs are directly routed to the compute nodes at each site and not through the gateway node. Only a single agent is started at a gateway node that submits and maintains job proxies in the local queue. Fault-tolerance: each semi-autonomous agent at a gateway node maintains the job proxy submission locally, allowing transient network outages to be tolerated, and gateway node reboots to be handled in isolation from the rest of the system. Technology-inheritances: the entire commodity Condor and SGE infrastructure, with all their associated add-on tools, is leveraged to provide a single job management environment for the user. Future versions of the system will allow other cluster workload management systems to be the user selectable interface to the TeraGrid, e.g. Portable Batch System (PBS) [18]. A. The Virtual Login Session Figure 2 and Figure 3 show Condor and SGE virtual login sessions. A number of features are highlighted in these figures. First, a vo-login command is invoked by the user. The command reads a configuration file, located in the users home directory, listing the participating sites in each session. Second, the command prompts the user for a GSI [2] password. This creates a valid GSI credential to spawn an agent at the participating sites. The command also allows the user to use an existing credential. Third, a command-line prompt is returned to the user, who is then able to issue Condor or SGE commands. Fourth, the user can query the status of all remote job proxies with the agent_jobs command. Fifth, the user can detach from the login shell and reattach to it at some future time from a different terminal. And finally, sixth, the system automatically cleans up the environment and shuts down all remote agents when the shell is exited. Figure 2 Formatted screenshot of a virtual login session creating a personal Condor pool. Figure 4 shows the process architecture of our system. When a user invokes vo-login on a client workstation, an Agent Manager process is started on the local machine. The agent manager remotely executes a Proxy Manager process at each participating site using Globus GRAM. The agent manager at the client workstation also forks a Master Node Manager process within a GUN Screen session that then starts the Condor/SGE master daemons in user space if needed. The proxy managers send periodic heart-beat messages back to the central agent manager to allow each to monitor the health of the other. Based on missed heart-beats, the agent manager may reallocate job proxies to other proxy managers, or a proxy manager may voluntarily shut down if it has not heard from the agent manager for a period of time. Figure 3 Formatted screenshot of a virtual login session creating a personal SGE cluster. 97 The proxy manager at each site in a virtual login session is responsible for submitting job proxies through a local

4 GridShell Submission Agent. This submission agent is invoked by the proxy manager by wrapping the Condor or SGE job-starter daemon executable in a GridShell script, and invoking it.

4 4 GridShell Submission Agent. This submission agent is invoked by the proxy manager by wrapping the Condor or SGE job-starter daemon executable in a GridShell script, and invoking it. The submission agent submits and maintains the job proxy in the local job queue using the GridShell infrastructure [16]. Figure 4 - Component architecture of the system. When the job proxy is finally run by the local site scheduler, a Task Manager process is started. The task manager is responsible for reading the slave node configuration and starting Slave Node Manager processes in parallel on the nodes allocated by the local site scheduler. The task manager then calls back to the proxy manager, allowing the proxy manager to begin monitoring the state of the running job proxy. The slave node managers are responsible for starting the Condor or SGE job-starter daemons. These daemons connect back to the master processes on the client workstation, or to a pre-existing departmental cluster, depending on the virtual login configuration. Job proxies now appear as available compute resource in the users expanding personal Condor or SGE cluster. B. Authentication Framework The TCP communication between the Proxy and Agent Managers are not encrypted. However, all TCP connections used in the system are authenticated. The authentication framework used by the system is shown in Figure 5. Vo-login session Remote Site When a user first invokes vo-login, a secret 64-bit key, auth_key, is randomly generated, and given to the Agent Manager process as an input parameter on startup. The vologin command then authenticates with each participating site using the GSI authentication mechanism. Once authenticated, the system spawns a GridShell script at the remote site, within which the Proxy Manager is started with the same auth_key as input parameter. Subsequently, all communications between the Proxy Manager and the Agent Manager is preceded by an auth_key encrypted challenge string. The challenge string consists of a fixed, previously agreed upon section, and a variable section. The system ensures that the variable section is unique for every new connection, and allows the Agent Manager to discard connections with the same encrypted challenge string. This is done to prevent replay attacks from intruders who have successfully sniffed a previous connection. C. Job Proxy States and Recovery Figure 6 shows the complete job proxy state transitions during normal fault-free run and recovery. For the fault-free run, the normal job proxy state transition is PEND RUN EXIT. A job proxy is in the PEND state if it is in the local queue and is waiting to be run. The job proxy is in the RUN state when is started by the local site scheduler and in the EXIT state when it terminates and is removed from the local queue. In a fault-free run, a job proxy in the RUN state goes to the EXIT state for either of two reasons: 1. The proxy is killed by the local batch system if its local wall-clock limit is reached. 2. The proxy exits voluntarily if it remains idle for some configurable time period. This can happen when there are insufficient user jobs that can be scheduled by the system. The job proxy may also transition between the PEND and RELEASE states. A job proxy is in the RELEASE state when it is being considered for migration to another site. This is explained further in section IV. When the gateway node reboots, the proxy manager is expected to accurately recover the state of its submitted proxies. The proxy manager is restarted by a crontab entry the system creates when the proxy manager is first started up. User ` (1) Generate auth_key (2) Start Agent Manager. (3) GSI authentication (4) Start Proxy Manager GRAM Job Agent Manager (5) Auth_key encrypted challenge. challenge string := <fixed>.<variable> Proxy Manager Figure 5. Authenticating connections between the Agent and Proxy Managers with an encrypted challenge string. 98

5 5 Figure 6 - Job proxy state transition diagram. When the proxy manager is restarted on recovery, it first checks a job-info file containing the last known job proxy states. If a proxy is logged in the RUN state, the proxy manager will attempt to connect to the host (and port) of its task manager, which is also logged in the job-info file. If it connects successfully, the proxy is transitions back to the RUN state; otherwise the proxy is transitioned to the EXIT state. If a job proxy is logged in the PEND state before recovery, the proxy manager will have to consider 4 possible proxy state transitions while it was down: State Transitions Recovery Events 1. PEND RUN EXIT No job-tag 2. PEND EXIT Found job-tag ^ not in queue 3. PEND (no change) Found job-tag ^ in queue 4. PEND RUN Found job-tag ^ connected to task starter All proxies submitted by the proxy manager have a unique job-tag logged in a cache directory. This job-tag is removed from the cache directory by the task manager when the proxy exits. Using this job-tag, the state transitions for cases 1 to 3 (shown in the table) can be distinguished by the absence or the presence of a proxy s job-tag when the proxy manager recovers. If a job-tag is found, the proxy manager transitions the proxy state to FOUND_TAG. The proxy manager then checks the local queue for the job proxy. This accommodates the edge case where the job proxy is manually removed from the local queue while the proxy manager was down. If the job proxy is still in the queue, the job proxy state is transitioned to PEND; otherwise its state is transitioned to EXIT. Finally for case 4, a job proxy PEND state will transition to a RUN state when the task manager for the job proxy reconnects with the proxy manager. This will eventually occur after recovery because the task manager will periodically attempt to reconnect with its proxy manager until sucessful. D. Agent Manager Failure and Recovery Mode The agent manager keeps minimal state. Information about remote proxy managers are not persisted because we assume they will eventually reconnect back with the agent manager when the client recovers. In our system the agent manager only persists its listening port for re-binding on recovery. E. Interposing domain name services system calls The gateway and compute nodes on all the clusters on the TeraGrid are multi-homed, i.e. there are multiple network interface cards (NIC), with multiple hostnames and IP addresses associated with each node. Therefore, in order to ensure that the Condor and SGE job starter daemons use the hostname associated with the correct NIC, i.e. the IP with the public network connectivity, we interpose the domain name services (DNS) uname() and gethostname() systems calls with a version that uses the name specified in the environment variable _GRID_MYHOSTNAME if defined. The _GRID_MYHOSTNAME environment variable is set by our system to the resolved hostname of the correct NIC for the Condor and SGE daemons to use prior to their startup. Our system discovers the correct NIC by allowing the slave node managers to connect to its known remote agent manager on startup. When a connection is successful, it checks for the locally bound IP address of the successful outbound TCP connection. Once this IP address is discovered, the environment variable _GRID_MYHOSTNAME is set to its resolved hostname. This procedure causes the Condor and SGE daemons to always use the hostname associated with the correct IP address. We have found this to be a very successful strategy for dealing with the complexities of starting Condor and SGE daemons in multi-homed environments, without need for modifying technology specific configuration files. IV. OVERHEAD OF EXECUTING JOBS THROUGH PROXIES In this section we examine the overhead of executing jobs through the job proxies created by our system. For our experiment we create one-cpu Condor and SGE clusters at NCSA, using a single long running job proxy. We then submit 3 jobs, each with a run time of 6 seconds, to these clusters from TACC. The total turnaround time, from time of first job execution to time of last job completion, is then measured. For the Condor case, we examined seven different job submission scenarios. In scenario one, we manually stage the required executable (sleep command), and tell Condor not to stage the executable before job execution. In scenario two, we allow Condor to stage the executable prior to job startup. The executable size was 1 KB. In scenarios three to eight, we use the Condor file transfer mechanism to stage data file of sizes 1MB, 2MB, 3MB, 4MB and 5MB respectively, prior to job startup. For the SGE case, we only measured the turnaround time for scenario one. This is because SGE does not have a built-in mechanism for staging binaries and input files. We therefore submitted a shell script, which SGE stages at the remote site, within which the pre-staged binary is invoked. For this 99

6 scenario, we also submitted the 3 jobs using the SGE jobarray submission feature. The turnaround timings for all scenarios and cases are shown in Figure 7.

6 6 scenario, we also submitted the 3 jobs using the SGE jobarray submission feature. The turnaround timings for all scenarios and cases are shown in Figure No staging Stage exec 1 MB transfer 2 MB transfer Condor SGE 3 MB transfer 4 MB transfer 5 MB transfer Figure 7. Total turnaround times for executing 3 jobs, with run times of 6 seconds each, submitted from TACC and executed through a single job proxy at NCSA. The overhead introduced by our job proxy for each scenario is shown in Table 1, where we assume the ideal turnaround time of 3 * 6 = 18 seconds. Table 1. Job execution time overhead when executing jobs through job proxies Condor 5% 9% 9% 17.5% 28.1% 29% 29.8% SGE 25% Smaller job proxies are advantageous because the local site scheduler may find CPU resources for it more quickly. Also smaller job proxies may result in fewer wasted CPUs close to the end of a job ensemble run as fewer jobs remain in the system. However, the disadvantage of smaller job proxies is that the system needs to submit and manage more job proxies across the sites. Submitting many more job proxies across sites then there are user jobs has the advantage of the system increasing its chance of finding resources more quickly for the user. However the disadvantage is that this may result in wasted resources. A job proxy is a wasted resource if it is never used during a virtual login session. These wasted job proxies consume resources when they are added and deleted from the local queue, and when the scheduler has to consider them for scheduling amongst other jobs in the queue. In our system, a wasted job proxy manifest as job proxies in the PEND state because job proxies voluntarily terminate when not used and is re-submitted by the system in the PEND state for future consideration. In this section we only consider a subset of the issues discussed above. We examine strategies that minimize the submission of wasted job proxies while maximizing the user job throughput. We examine three allocation strategies in this section: (A) over-allocating, (B) under-allocating and expanding allocation over time, and (C) sharing a fixed allocation between sites. The Condor job proxy introduces a much smaller overhead then the SGE job proxy. However, in both cases, we feel the overhead of executing user jobs through our job proxies is acceptable. We do however recommend the Condor job proxy, over the SGE job proxy, based on our experimental findings. V. ADAPTIVE JOB PROXY ALLOCATION STRATEGIES Our system executes user jobs on CPU resources obtained from job proxies submitted to sites on the TeraGrid. Figure 8 illustrates a scenario where five user jobs are mapped to two job proxies. The two job proxies have been submitted to a single site, and each proxy has a CPU count of two. Depending on the prevailing load conditions, all jobs may be able to run in their own individual CPU acquired by the job proxies, or multiple jobs may have to run consecutively on fewer CPUs. If there are few user jobs, some proxies may never be used, and are therefore wasted if run. Also, resources may be wasted if only a subset of a proxy s CPU allocation is used, since the entire job proxy remains running until no more jobs are assigned to it. It is therefore important to carefully consider job proxy allocation strategies to maximize the user s job throughput, and minimize wasted CPU resources. Two pertinent questions need to be considered by our choice of job proxy allocation strategies: how large should job proxies be, and how many job proxies should the system submit. Figure 8 - Mapping five user jobs to two job proxies of CPU count two each. Both proxies run on one site. A. Experimental Setup We evaluate the proposed job proxy allocation strategies by dynamically creating personal clusters using the clusters at the Texas Advanced Computing Center (TACC), National Center for Supercomputing Applications (NCSA), San Diego Supercomputing Center (SDSC), and Center for Advanced Computing Research (CACR). We then simulate a scientist submitting 1 jobs to the personal cluster, with each job run-time uniformly distributed between 3 and 6 minutes. We examine the CPU profile of the dynamically generated personal clusters over time as these 1 user jobs are processed through it. The turnaround times of the entire experiment is also measured to quantify the impact of the different strategies to the user. Twelve experiments were 1

7 7 conducted by alternating between the proposed strategies over the course of three consecutive days. The remainder of the section examines some representative results from the bestperforming runs using each of the proposed strategies. B. Strategy A: Over allocating job proxies In this strategy the total number of CPUs requested by the job proxies is approximately twice the number of user jobs submitted. The job proxy allocation is divided equally across the proxy managers and is fixed for the duration of the experiment. In the experiment we configured the proxy managers to submit 15 jobs, each requesting 32 CPUs, at every site. This equates to requesting 192 CPUs in total across all the sites for the duration of the experiment. The CPU profile of the personal cluster generated over the course of the experiment is shown in Figure 9. The profile shows an average personal cluster size of 933 processors. CPU Figure 9 - CPUs provisioned to personal cluster over time with strategy A. The number of job proxies in the PEND state during the lifetime of the experiment is shown in Figure 1. On average, 29 job proxies are always in the PEND state throughout the lifetime of the experiment. Job Proxies Figure 1 - Number of PEND job proxies over time with strategy A. which have the potential of running more proxies successfully. Job Proxies NCSA CACR SDSC TACC Figure 11 - Job proxies assigned to sites over time with strategy B. In the experiment, the proxy manager at each site is initially allocated with 5 job proxies. Figure 11 shows the expanding job proxy allocation at each of the four sites over time (at 5 minute time step intervals). The graph shows the proxy manager at SDSC acquiring jobs over the course of the experiment, reaching a peak of 14 job proxies. The proxy manager at TACC also acquired some additional number of job proxies, reaching a peak of 9 allocated job proxies: 1 pending and 8 running job proxies of 32 processors each. Interestingly, this is the 256 processor per-user limit set in the local batch queuing system at TACC. Figure 12 shows the number of job proxies in the PEND state during the experiment. The number of job proxies in the PEND state fluctuated between 4 and 7 over the course of the experiment, but is an improvement over strategy A. Job Proxies Figure 12 - Number of PEND job proxies over time with strategy B. The CPU profile of the personal cluster generated over time during the experiment is shown in Figure 13. The pool grows from an initial size of 4 processors to over 8 processors. The average size is shown to be 744 processors. C. Strategy B: Under allocating and expanding allocation over time In this strategy, the total number of CPU requested by the job proxies is initially fewer then the number of user jobs submitted. However, as time progesses, the proxy managers independently increase their proxy allocation as soon as all it s current set of proxies are running and used by our system. The rationale is to grow the job proxy allocation on sites 11

8 8 CPU Figure 13 - CPUs provisioned to personal cluster over time with strategy B. D. Strategy C: Sharing a fixed allocation between sites In this strategy, the total number of CPU requested by the job proxies is the same as the number of user jobs submitted. The initial job proxy allocation is divided equally across the proxy managers on all the participating sites. As time progresses, the proxy managers adjust their proxy allocation by migrating job proxies in the PEND state to sites where all proxies are currently running in the local system. The rationale is to migrate job proxies to sites with shorter queue wait times, adjusting the initial job proxy allocation to maximize the user job throughput. Figure 14 shows the UML sequence diagram for the job proxy sharing strategy. When all the job proxies at a site are in the RUN state, the proxy manager can inform the agent manager that it is ready to accept more proxies. The central agent manager may then give it a proxy from its stash (containing proxies it may have acquired when a proxy manager shuts down) or from another site with available job proxies in the PEND state. proxies in the PEND state is dramatically reduced with this strategy: during the peak of the experiment all the job proxies were running. The increase in job proxies in the PEND state close to the end of the experiment is explained by the fewer jobs remaining in the system, resulting in the job proxies being terminated and resubmitted back into cluster in the PEND state. The CPU profile of the personal cluster generated over the course of the experiment is shown in Figure 17. The average cluster size was 952 processors. Job Proxies NCSA CACR SDSC TACC Figure 15 - Job proxies assigned to site over time with strategy C. Job Proxies Figure 16 - Number of PEND job proxies over time with strategy C Figure 14 - UML sequence diagram of the job proxy sharing strategy between sites. Figure 15 shows the total number of job proxies allocated to each of the four clusters during the experiment. This time the graph shows the proxy manager at NCSA pulling job proxies over the course of the experiment, reaching a peak allocation of 14 job proxies. The proxy manager at SDSC and TACC also acquired additional job proxies but gave them up as they decided the additional job proxies could not be run. The proxy manager at CACR is seen to give up most of its job proxies, resulting in only 3 allocated job proxies over time; probably due to a higher local job load at CACR during this experiment. Figure 16 shows the number of job proxies in the PEND state during the experiment s life time. The number of job CPU Figure 17 - CPUs provisioned to personal cluster over time with strategy C. E. Experiment Turnaround Times The turnaround times of the experiments using the three different proxy allocation strategies are shown in Table 2. We also provide the turnaround time of the same experiment conducted at just one site (TACC), without using our system, as reference. The turnaround time is defined as the time from first job submission to the completion of the 1 th job in our 12

9 9 experiment. All results are derived from four experimental runs each. The single site run was conducted a week after the cross-site runs through the system were completed, hence prevailing load conditions at each site may have changed in the interim. However, we feel comparison between the results is still instructive. Strategy A performs the best as expected. Strategy C compares well with the overly greedy strategy A. Strategy B performs the worst amongst the three strategies. However all three strategies provide 4%-6% improvement in experiment turnaround times compared to just submitting the user jobs to a single site. Our system allows users to pick any of the three strategies studied in this section. However strategy C (sharing a fixed allocation between sites) provides the best space versus job turnaround time tradeoff. Strategy A affords an acceptable strategy if sites are not impacted by the system queuing many, possibly unneeded, job proxies in the local scheduler. Where a large number of queued job proxies can affect the local schedulers, or have a psychological impact on other users on the system, strategy C is recommended. Strategy B could possibly perform better with jobs of different run times, but we can assume strategy C to be appropriate in those scenarios too. Table 2 MAX/MIN/MEAN of experiment turnaround times (seconds) using the different strategies and running at a single site (TACC). Strategy A Strategy B Strategy C Single Site (TACC) MIN MAX MEAN VI. CONCLUSION AND FUTURE WORK The Condor version of the system is currently in production deployment on the NSF TeraGrid [13]. Up to 1, jobs have been submitted through the system to date, mainly through the Caltech CMS group and the NVO project, enabling approximately 9 teraflops of scientific computation. The system enables users to benefit from a learn-once-run-anywhere submission environment, and resource providers benefit by not needing to accommodate disruptive changes in their local cluster configuration. Future versions of the system will support other workload management interfaces like PBS. Also interesting open problems still exists such as how can the system automatically determine and adjust the job proxy CPU size request to optimize the personal cluster creation times, and reduce resource wastage. Also, how can parallel job ensembles be supported, especially across sites with non-uniform interconnect latencies. Finally the scheduling technologies developed by the Pegasus/Condor-G, Nimrod/G and APST systems can be leveraged in the future to intelligently schedule job proxies across the TeraGrid. We leave these topics as potential areas for future work. VII. ACKNOWLEDGMENT We would like to thank the anonymous reviewers for their helpful and constructive comments. We would also like to thank the TeraGrid Grid Infrastructure Group (GIG), and the resource providers system administration staff, for useful feedback and help in getting the system into production. REFERENCES [1] The NSF TeraGrid, [2] E. Walker, J. P. Gardner, et. al.. TeraGrid Scheduling Requirement Analysis Team Final Report, [3] The Compact Muon Solenoid Experiment, [4] NSF National Virtual Observatory TeraGrid Utilization Proposal to NRAC, 24, [5] I. Foster and C. Kesselman, Globus: A Metacomputing Infrastructure Toolkit, International Journal of Supercomputing Applications, 11(2): , [6] Globus Toolkit, [7] Grid Resource Allocation and Management (GRAM) component, [8] UNICORE forum, [9] J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke, Condor-G: A Computation Management Agent for Multi-Institutional Grids, in Proceedings 1 th IEEE International Symposium on High Performance Distributed Computing, San Francisco, California, August 21. [1] R. Buyya, D. Abramson and J. Giddy, Nimrod/G: An architecture of a resource management and scheduling system in a global Computational Grid, In Proceedings of HPC Asia, pp , May 2. [11] H. Casanova, G. Obertelli, F. Berman, and R. Wolski. The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid, in Proceedings of Supercomputing (SC), pp , Nov 2. [12] TeraGrid site scheduling policies, [13] TeraGrid GridShell/Condor System, [14] Condor, High Throughput Computing Environment, [15] M. Litzkow, M. Livny, and M. Matka. Condor A Hunter of Idle Workstations, In Proceeding of the International Conference of Distributed Computing Systems, pp , June [16] GridShell, [17] E. Walker and T. Minyard. Orchestrating and Coordinating Scientific/Engineering Workflows using GridShell, In Proceedings 13 th IEEE International Symposium on High Performance Distributed Computing, Honolulu, Hawaii, pp , June 24. [18] Portable Batch System, [19] Sun Grid Engine, [2] The Globus Alliance. Overview of the Grid Security Infrastructure, [21] Ewa Deelman, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, Kent Blackburn, Albert Lazzarini, Adam Arbree, Richard Cavanaugh, and Scott Koranda, Mapping Abstract Complex Workflows onto Grid Environments, Journal of Grid Computing, 1(1):25 29,

10 14

Part IV. Workflow Mapping and Execution in Pegasus. (Thanks to Ewa Deelman)

Part IV. Workflow Mapping and Execution in Pegasus. (Thanks to Ewa Deelman) AAAI-08 Tutorial on Computational Workflows for Large-Scale Artificial Intelligence Research Part IV Workflow Mapping and Execution in Pegasus (Thanks to Ewa Deelman) 1 Pegasus-Workflow Management System