CERN: LSF and HTCondor Batch Services

Size: px

Start display at page:

Download "CERN: LSF and HTCondor Batch Services"

Alexis Stevenson
5 years ago
Views:

2 CERN: LSF and HTCondor Batch Services Iain Steers, Jérôme Belleman, Ulrich Schwickerath IT-PES-PS INFN Visit: Batch CERN 2

3 Outline The Move Environment Grid Pilot Local Jobs Conclusion INFN Visit: Batch CERN 3

4 Why Move? Several primary reasons: Scalability Dynamism Open-Source and Community Other reasons as well. INFN Visit: Batch CERN 4

5 Scalability Hard limit on the number of nodes LSF can support. We continue to get closer. Even below, LSF based system has become very difficult to manage. INFN Visit: Batch CERN 5

6 Community LSF is proprietary, was owned by Platform Inc. now IBM. Most sites have moved or are moving from their systems to HTCondor. From experience HTCondor seems to have a brilliant community. INFN Visit: Batch CERN 6

7 Some Numbers What we d like to achieve: Goals Concerns with LSF to nodes nodes max Cluster dynamism Adding/Removing nodes requires reconfiguration 10 to 100 Hz dispatch Transient dispatch rate problems 100 Hz query scaling Slow query/submission response times INFN Visit: Batch CERN 7

8 Our Compute Environment CERN is a heavy user of the Openstack project. Most of our compute environment is now virtual. Configuration of nodes is done via Puppet. INFN Visit: Batch CERN 8

9 Configuration and Deployment Using the HTCondor and ARC CE Puppet modules in the HEP-Puppet Github, plus some of our own. Node lists for the HTCondor configs are pre-seeded via puppetdb queries and hiera. Works very nicely at the moment. INFN Visit: Batch CERN 9

10 ARC CE Configuration catches Had to hack the bdii generation in /usr/share/arc/glue-generator.pl. This was to add correct job counts for VOViews. Cron job running that basically does condor q -af X509UserProxyVoName sort uniq -c. This gets piped to a file which the glue-generator reads. Thanks to RAL for the changes. INFN Visit: Batch CERN 10

11 Dynamic Node Deployment: Serf All HTCondor nodes are running Serf, using an encryption key deployed by our Teigi secrets infrastructure. Allows us to do dynamic node discovery and membership actions. When a new HTCondor worker node joins, the HTCondor security config is re-generated by a python script. Some security/sanity checks made by calls to puppetdb. INFN Visit: Batch CERN 11

12 Why start with Grid? Less to do, with regard to Kerberos/AFS etc. Means early-adopters can help us find problems. Can have a PoC running whilst local-work is ongoing. Smaller number of jobs compared to local. INFN Visit: Batch CERN 12

13 Worker Nodes 8-Core flavour VMs with 16GB RAM partitioned to 8 job slots. Standard WLCG nodes with glexec running a HTCondor version of MachineJob Features. INFN Visit: Batch CERN 13

14 Queues Biggest departure from our current approach, visible from the outside world for Grid jobs. Following the HTCondor way with a single queue, seems to simplify most things. INFN Visit: Batch CERN 14

15 Management The HTCondor Python Bindings have more than stepped up to the plate here. One of our favourite features/parts of HTCondor, can query anything and everything. Most of the management of the pool will be done via the Python-bindings. Most of our monitoring uses the classad projections to avoid unnecessary network transport/load. INFN Visit: Batch CERN 15

16 Accounting Written a Python library that uses the classads library. Accountant daemon running that picks up HTCondor jobs as they finish. Backend agnostic, currently sends jobs data to an accounting db, elasticsearch and our in-house monitoring. Data also being pumped to HDFS and our analytics cluster for later analysis. INFN Visit: Batch CERN 16

17 Monitoring INFN Visit: Batch CERN 17

18 Current Status What have we achieved so far? All Grid items are in place. Grid Pilot is open to all experiments with 440 cores. Running CGROUPS and doing /tmp/ bind-mounts for easy clean-up of pool. In the process of enabling multi-core. INFN Visit: Batch CERN 18

19 The Future Local jobs and taking the Grid PoC to production. INFN Visit: Batch CERN 19

20 Distributed Schedulers 80% of our job-load will be local submission. Several thousand users all wanting to query schedulers... INFN Visit: Batch CERN 20

21 Distributed Schedulers Problem: How do we assign users to a scheduler? Going with a DNS delegation based solution. Hash a user to a schedd i.e. iasteers.condor.cern.ch maps to condorsched01.cern.ch. Using a ketama-hashing solution, changing number of schedds doesn t result in re-maps. INFN Visit: Batch CERN 21

22 User Areas and Authorization Kerberos is needed for local jobs, for things such as access to user areas. Although not related directly to HTCondor, this is a good opportunity to review our existing solutions. HTCondor team have agreed that a solution to this is best housed as a native part of HTCondor. INFN Visit: Batch CERN 22

23 Group Membership Easy for Grid jobs with relatively neat classad expressions. Not so obvious for local submission with 100s of group.subgroup combinations. Python classad functions and lru cache may be our solution here. INFN Visit: Batch CERN 23

24 Conclusion Upgrade LSF to version 9 by end of year. Local jobs and production grid service by end of year. Questions? INFN Visit: Batch CERN 24

HTCondor Week 2015: Implementing an HTCondor service at CERN

HTCondor Week 2015: Implementing an HTCondor service at CERN Iain Steers, Jérôme Belleman, Ulrich Schwickerath IT-PES-PS HTCondor Week 2015 HTCondor at CERN 2 Outline The Move Environment Grid Pilot Local