Resource Allocation in computational Grids

Size: px

Start display at page:

Download "Resource Allocation in computational Grids"

Jeffery Snow
6 years ago
Views:

1 Grid Computing Competence Center Resource Allocation in computational Grids Riccardo Murri Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich Nov. 23, 21

2 Scheduling on a cluster internet ssh username@server batch system server local 1Gb/s ethernet network compute node 1 compute node 2 compute node N All job requests sent to a central server. The server decides which job runs where and when. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

3 where: resource allocation model Computing resources are defined by a structured set of attributes (key=value pairs). SGE s default configuration defines 53 such attributes: number of available cores/cpus; total size of RAM/swap; current load average; etc. A node is eligible for running a job iff the node attributes are compatible with the job resource requirements. (Other batch systems are similar.) Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

4 when: scheduling policy There are usually more jobs than the system can handle concurrently. (Even more so, in high-throughput computing cases we are interested in.) So, job requests must be prioritized. Prioritization of requests is a matter of the local scheduling policy. (And this differs greatly among batch systems and among sites.) Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

5 (Hidden) assumptions 1. The scheduling server has complete knowledge of the nodes Local networks have low latency (RTT average 0.3 ms on a 1GB/s ethernet) and the status information is a small packet. 2. The server has complete control over the nodes So a compute node will immediately execute a job when told by the server. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

6 How does this extend to Grid computing? By definition of a Grid It s geographically distributed High-latency links (hence: resource status may be not up-to-date) Network is easily partitioned or nodes disconnected (hence: resources have a dynamic nature; they may come and go) 2. Resources come from multiple control domains Prioritization is a matter of local policy! AuthZ and other issues may prevent execution at all. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

7 The Globus/ARC model batch system server local 1Gb/s ethernet network internet compute node 1 compute node 2 compute node N arcsub/arcstat/arcget batch system server batch system server local 1Gb/s ethernet network local 1Gb/s ethernet network compute node 1 compute node 2 compute node N compute node 1 compute node 2 compute node N An infrastructure is a set of independent clusters. The client host selects one cluster and submits a job there. Then periodically polls for status information. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

8 Issues in the Globus/ARC approach? 1. How to select a good execution site? 2. How to gather the required information from the sites? 3. Based on the same information, two clients can arrive on the same scheduling information, hence they can flood a site with jobs. 4. Actual job start times are unpredictable, as scheduling is ultimately a local decision. 5. Client polling increases the load linearly with the number of jobs. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

9 The MDS InfoSystem, I arcsub/arcstat/arcget batch system server GRIS internet local 1Gb/s ethernet network batch system server GRIS compute node 2 batch system server compute node N GRIS local 1Gb/s ethernet network local 1Gb/s ethernet network compute node 2 compute node N compute node 2 compute node N The Globus Monitoring and Discovery Service Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

10 The MDS InfoSystem, II A specialized service provides information about site status. Each site reports its information to a local database (GRIS). Each GRIS registers with a global indexing service (GIIS). The client talks with the GIIS to get the list of sites, and then queries each GRIS for the site-specific information. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

11 LDAP The protocol underlying MDS is called LDAP. LDAP allows remote read/write accesses to a distributed database ( X.500 directory system ), with a flexible authentication and authorization scheme. LDAP makes the assumptions that most accesses are reads, so LDAP servers are optimized for infrequent writes. Reference: A. S. Tanenbaum, Computer Networks, ISBN Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

12 LDAP schemas Entries in an LDAP database are sets of key/value pairs. (Keys need not be unique; equivalently: a key can map to multiple values.) An LDAP schema specifies the names of allowed keys, and the type of corresponding values. Each entry declares a set of schemas it conforms to; every attribute in an LDAP entry must be defined in some schema. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

13 X.500/LDAP Directories Entries are organized into a tree structure (DIT). (So LDAP queries return subtrees, as opposed to flat sets of rows as in a RDBMS query.) Each entry is uniquely identified by a Distinguished Name (DN). The DN of an entry is formed by appending a one or more attribute values to the parent entry s DN. LDAP accesses might result in referrals, which redirect the client to access another entry at a remote server. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

14 Example Example: this is how the ARC MDS represent information about a cluster queue in LDAP. # all.q, gordias.unige.ch, Switzerland, grid dn: nordugrid-queue-name=all.q,nordugrid-cluster-name=gordias.unige. ch,mds-vo-name=switzerland,o=grid objectclass: Mds objectclass: nordugrid-queue nordugrid-queue-name: all.q nordugrid-queue-status: active nordugrid-queue-comment: sge default queue nordugrid-queue-homogeneity: TRUE nordugrid-queue-nodecpu: Xeon 2800 MHz nordugrid-queue-nodememory: 2048 nordugrid-queue-architecture: x86_64 nordugrid-queue-opsys: ScientificLinux-5.5 nordugrid-queue-totalcpus: 224 nordugrid-queue-gridqueued: 0 nordugrid-queue-prelrmsqueued: 4 nordugrid-queue-gridrunning: 0 nordugrid-queue-running: 0 nordugrid-queue-maxrunning: 136 nordugrid-queue-localqueued: 4 Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

15 Based on the information in the previous slide, can you decide whether to send a job that requires 200GB of scratch space to this cluster? Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

16 The MDS cluster model Exactly: there s no way to make that decision. ARC (and Globus) only provide CPU/RAM/architecture information. In addition, they assume clusters are organized into homogeneous queues, which might not be the case. This is just an example of a more general problem: what information do we need of a remote cluster and how to represent it? Reference: B. Kónya, The ARC Information System, infosys.pdf Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

17 MDS performance The complete LDAP tree of the SMSCG grid counts over entries. A full dump of the SMSCG infosystem tree requires about 30 seconds. So: 1. Information is several seconds old (on average) 2. It does not make sense to refresh information more often that this. By default, ARC refreshes the infosystem every 60 seconds. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

18 Supported and unsupported use cases, I Pre-installed application: OK The ARC InfoSys has a generic mechanism ( run time environments ) for providing installed software information. So you can select only sites that provide the application you want. (And the information provided in the InfoSys is usually enough to make a good guess about the overall performance.) Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

19 Supported and unsupported use cases, II Single-thread CPU-intensive native binary: OK However, the binary cannot not require unusual dynamic libraries; the binary cannot use CPU-specific features (no information on CPU model, so you cannot broker on that). Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

20 Supported and unsupported use cases, III Java/Python/Ruby/R script Require brokering based on a large number of support library/packages: if the dependencies are not there, the program cannot run. In theory, this solves the issue. In practice: there is always less information that would be useful, and providing all the information that would be useful is too much work. Ultimately, it relies on convention and good practice. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

21 Supported and unsupported use cases, IV Code benchmarking: FAIL Benchmarking code requires running all cases under the same conditions. There is just no way to guarantee that with the federation of clusters model: e.g., the site batch scheduler may run two jobs on compute nodes with a different CPU. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

22 Supported and unsupported use cases, V Parallel jobs: FAIL You can request a certain number of CPUs, but you have no information and no control over: CPU/threads allocation: all slots in a single large SMP machine? slots distributed evenly across nodes? communication mechanism: which MPI library is used? which transport fabric? (In theory, this can be solved by a careful choice of run time environments. In practice, it means that everybody has to agree how to represent that information, so it just replicates the schema problem.) Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

23 ARC: Pros and Cons Pros: Very simple to deploy, easy to extend. System and code complexity still manageable. Cons: The burden for scaling up is on each site, but not all sites have the required know-how/resources. Complexity of managing large collections of jobs is on the client software side. Fixed infosystem schema does not accomodate certain use cases. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

24 The glite approach glite job submit WMS top BDII site BDII batch system server 01 site BDII batch system server 01 site BDII batch system server local 1Gb/s ethernet network local 1Gb/s ethernet network local 1Gb/s ethernet network compute node 1 compute node 2 compute node N compute node 1 compute node 2 compute node N compute node 1 compute node 2 compute node N Reference: content/article/51-generaltechdocs/57-archoverview Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

25 The glite WMS Server-centric architecture: All jobs are submitted to the WMS server. WMS inspects the Grid status, makes the scheduling decision and submits jobs to sites. The WMS also monitors jobs as they run, and fetches back the output when a job is done. The client polls the WMS, and when a job is done gets the output from the WMS. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

26 The glite infosystem, I Hierarchical architecture, based on LDAP: 1. Each Grid element runs its own LDAP server (resource BDII) providing information on the software status and capabilities. 2. A site-bdii polls the local element servers, and aggregates information into a site view. 3. A top-bdii polls the site BDIIs and aggregates information into a global view. Each step requires processing the collected entries and creating a new LDIF tree based on the new information. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

27 The glite infosystem, II The CREAM computing element at CSCS has 43 entries in its resource BDII. Listing them takes 0.5 seconds. The CSCS site-bdii has 191 entries. Listing them takes 0.5 seconds. The CERN top-bdii has > entries, collected from circa 200 sites. Listing them all takes over 2 minutes time. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

28 The GLUE schema The glite information system represents systems status based on the GLUE schema. (Version 1.3 currently being phased out in favor of v. 2.0) Comprehensive and complex schema: 1. aimed at interoperability among Grid providers; 2. attempt to cover every feature supported by the major middlewares and production infrastructures (esp. HEP); 3. heavy use of cross-entry references. Can accomodate the scratch space example, but there s still no way of figuring out whether (and how) a job can request 16 cores on the same physical node. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

29 Comparison with ARC s InfoSystem ARC stores information about jobs and users in the infosystem: relatively large number of entries in the ARC infosys cannot scale to a large high-throughput infrastructure However, glite s BDII puts a large load on the top BDII: must handle load from all clients must be able to poll all site-bdiis in a fixed time so it must cope with network timeouts, slow sites, etc. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

30 glite WMS: Pros and Cons Pros: Global view of the Grid, could take better meta-scheduling decisions. Can support aggregate job types (e.g., workflows) Aggregates the monitoring operations, so reduces the load on site. Cons: The WMS is a single point of failure. Clients still use a polling mechanism, so the WMS must sustain the load. Extremely complex piece of software, running on a single machine: very hard to scale up! Relies on a infosystem to take sensible decisions (fixed schema/representation problem). Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

31 Condor condor_agent condor_master condor_submit batch system server condor_resource condor_resource batch system server condor_resource batch system server local 1Gb/s ethernet network local 1Gb/s ethernet network local 1Gb/s ethernet network compute node 1 compute node 2 compute node N compute node 1 compute node 2 compute node N compute node 1 compute node 2 compute node N Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

32 Condor overview Agents (client-side software) and Resources (cluster-side software) advertise their requests and capabilities to the Condor Master. The Master performs match-making between Agents requests and Resources offerings. An Agent sends its computational job directly to the matching Resource. Reference: Thain, D., Tannenbaum, T. and Livny, M. (2005): Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 17: Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

33 What is matchmaking? Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

34 Matchmaking, I Same idea in Condor, except the schema is not fixed. Agents and Resources report their requests and offers using the ClassAd format (an enriched key=value format). No prescribed schema, hence a Resource is free to advertise any interesting feature it has, and to represent it in any way that fits the key=value model. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

35 Matchmaking, II 1. Agents specify a Requirements constraint: a boolean expression that can use any value from the Agents own (self) ClassAd or the Resource s (other). 2a. Resources whose offered ClassAd does not satisfy the Requirements constraint are discarded. 2b. Conversely, if the Agents ClassAd does not satisfy the Resource Requirements, the Resource is discarded. 3. Surviving Resources are sorted according to the value of the Rank expression in the Agent s ClassAd, and their list is returned to the Agent. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

36 Example: Job ClassAd Select 64-bit Linux hosts, and sort them preferring hosts with larger memory and CPU speed. Requirements = Arch=="x86_64" && OpSys == "LINUX" Rank = TARGET.Memory + TARGET.Mips Agent ClassAds play a role similar to job descriptions in ARC/gLite: specify the compatibility/resource requests. Reference: 4/4 1Condor s ClassAd.html Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

37 Example: Resource ClassAd A complex access policy, giving priority to users from the owner research group, then other friend users, and then the rest... Friend = Owner == "tannenba" ResearchGroup = (Owner == "jbasney" Owner == "raman") Trusted = Owner!= "rival" Requirements = Trusted && ( ResearchGroup LoadAvg < 0.3 && KeyboardIdle > 15*60 ) Rank = Friend + ResearchGroup*10 Resource ClassAds specify an access/usage policy for the resource. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

38 ClassAd wrap-up ClassAds provide an extensible mechanism for describing resources and requirements: 1. A set of standard ClassAd values is provided by Condor itself; 2. New values can be defined by the user (both client- and server-side). How can you submit a job that requires 200GB of local scratch space? Or 16 cores in a single node? Providing the right attributes for the match is now a organizational problem, not a technical one. Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

39 All these job management systems are based on a push model (you send the job to an execution cluster). Is there conversely a pull model? Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

40 References 1. Foster, I. (2002): What is the Grid? A Three Point Checklist., Grid Today, July 20, Thain, D., Tannenbaum, T. and Livny, M. (2005): Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 17: DOI: /cpe Kónya, B. (20): The ARC Information System, infosys.pdf 4. Cecchi, M. et al. (2009): The glite Workload Management System, Lecture Notes in Computer Science, 5529/2009, pp Andreozzi, S. et al. (2009): GLUE Specification v. 2.0, Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

41 Average ping RTT to some SMSCG clusters. cluster time (ms) idgc3grid.uzh.ch hera.wsl.ch arc.lcg.cscs.ch smscg.epfl.ch gordias.unige.ch Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

42 Time to retrieve a single LDAP entry cluster time (ms) connect time (ms) smscg.epfl.ch gordias.unige.ch idgc3grid.uzh.ch arc.lcg.cscs.ch hera.wsl.ch Grid resource allocation R. Murri, Large Scale Computing Infrastructures, Nov. 23, 21

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why the Grid? Science is becoming increasingly digital and needs to deal with increasing amounts of