HTCondor Week 2015: Implementing an HTCondor service at CERN

Similar documents
CERN: LSF and HTCondor Batch Services

Batch Services at CERN: Status and Future Evolution

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

Introducing the HTCondor-CE

LHCb experience running jobs in virtual machines

Clouds in High Energy Physics

Singularity in CMS. Over a million containers served

Clouds at other sites T2-type computing

WLCG Lightweight Sites

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

Batch Share Management Tool

glideinwms: Quick Facts

PROOF-Condor integration for ATLAS

Monitoring HTCondor with the BigPanDA monitoring package

DIRAC pilot framework and the DIRAC Workload Management System

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

An update on the scalability limits of the Condor batch system

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Interfacing HTCondor-CE with OpenStack: technical questions

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

Virtualization of the ATLAS Tier-2/3 environment on the HPC cluster NEMO

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

HTCondor on Titan. Wisconsin IceCube Particle Astrophysics Center. Vladimir Brik. HTCondor Week May 2018

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

Opportunities for container environments on Cray XC30 with GPU devices

Monitoring and Analytics With HTCondor Data

Singularity tests at CC-IN2P3 for Atlas

CouchDB-based system for data management in a Grid environment Implementation and Experience

New Directions and BNL

ATLAS TDAQ System Administration: Master of Puppets

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Jozef Cernak, Marek Kocan, Eva Cernakova (P. J. Safarik University in Kosice, Kosice, Slovak Republic)

ATLAS Oracle database applications and plans for use of the Oracle 11g enhancements

Storage Resource Sharing with CASTOR.

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

Flying HTCondor at 100gbps Over the Golden State

EPHEMERAL DEVOPS: ADVENTURES IN MANAGING SHORT-LIVED SYSTEMS

Linux and Configuration Support Section

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Part2: Let s pick one cloud IaaS middleware: OpenStack. Sergio Maffioletti

Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison

Argus Authorization Service

Lecture 11 Hadoop & Spark

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Using a dynamic data federation for running Belle-II simulation applications in a distributed cloud environment

Tier 3 batch system data locality via managed caches

Using Puppet to contextualize computing resources for ATLAS analysis on Google Compute Engine

Benchmarking and accounting for the (private) cloud

Analytics Platform for ATLAS Computing Services

Helix Nebula The Science Cloud

A WEB-BASED SOLUTION TO VISUALIZE OPERATIONAL MONITORING LINUX CLUSTER FOR THE PROTODUNE DATA QUALITY MONITORING CLUSTER

Operating the Distributed NDGF Tier-1

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

The ATLAS EventIndex: Full chain deployment and first operation

Containerized Cloud Scheduling Environment

AMGA metadata catalogue system

On-demand Authentication Infrastructure for Test and Development Andrew Leonard Dell EMC/Isilon

Management of batch at CERN

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

The PanDA System in the ATLAS Experiment

The CORAL Project. Dirk Düllmann for the CORAL team Open Grid Forum, Database Workshop Barcelona, 4 June 2008

ElastiCluster Automated provisioning of computational clusters in the cloud

UW-ATLAS Experiences with Condor

The ATLAS Software Installation System v2 Alessandro De Salvo Mayuko Kataoka, Arturo Sanchez Pineda,Yuri Smirnov CHEP 2015

Grid Experiment and Job Management

The INFN Tier1. 1. INFN-CNAF, Italy

Connecting Restricted, High-Availability, or Low-Latency Resources to a Seamless Global Pool for CMS

glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

LHConCRAY. Acceptance Tests 2017 Run4 System Report Miguel Gila, CSCS August 03, 2017

Beyond 1001 Dedicated Data Service Instances

Workload management at KEK/CRC -- status and plan

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

ATLAS Tier-3 UniGe

An overview of batch processing. 1-June-2017

COPYRIGHTED MATERIAL. Contents at a Glance

Real World Web Scalability. Ask Bjørn Hansen Develooper LLC

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

HTCondor with KRB/AFS Setup and first experiences on the DESY interactive batch farm

(HT)Condor - Past and Future

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Corral: A Glide-in Based Service for Resource Provisioning

High Throughput Urgent Computing

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

OpenStack Magnum Pike and the CERN cloud. Spyros

Changing landscape of computing at BNL

glite Grid Services Overview

Expressing Parallelism with ROOT

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

MSG: An Overview of a Messaging System for the Grid

Grid Computing Activities at KIT

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

Andrej Filipčič

Discover SUSE Manager

EGI-InSPIRE. Security Drill Group: Security Service Challenges. Oscar Koeroo. Together with: 09/23/11 1 EGI-InSPIRE RI

Transcription:

HTCondor Week 2015: Implementing an HTCondor service at CERN Iain Steers, Jérôme Belleman, Ulrich Schwickerath IT-PES-PS HTCondor Week 2015 HTCondor at CERN 2

Outline The Move Environment Grid Pilot Local Jobs Conclusion HTCondor Week 2015 HTCondor at CERN 3

Why Move? Several primary reasons: Scalability Dynamism Open-Source and Community Other reasons as well. HTCondor Week 2015 HTCondor at CERN 4

Scalability Hard limit on the number of nodes LSF can support. We continue to get closer. Even below, LSF based system has become very difficult to manage. HTCondor Week 2015 HTCondor at CERN 5

Community LSF is proprietary, was owned by Platform Inc. now IBM. Most sites have moved or are moving from their systems to HTCondor. From experience HTCondor seems to have a brilliant community. HTCondor Week 2015 HTCondor at CERN 6

Some Numbers What we d like to achieve: Goals Concerns with LSF 30 000 to 50 000 nodes 6 500 nodes max Cluster dynamism Adding/Removing nodes requires reconfiguration 10 to 100 Hz dispatch Transient dispatch rate problems 100 Hz query scaling Slow query/submission response times HTCondor Week 2015 HTCondor at CERN 7

Our Compute Environment CERN is a heavy user of the Openstack project. Most of our compute environment is now virtual. Configuration of nodes is done via Puppet. HTCondor Week 2015 HTCondor at CERN 8

Configuration and Deployment Using the HTCondor and ARC CE Puppet modules in the HEP-Puppet Github, plus some of our own. Node lists for the HTCondor configs are generated via puppetdb queries and hiera. Works very nicely at the moment. HTCondor Week 2015 HTCondor at CERN 9

Why just Grid? Less to do, with regard to Kerberos/AFS etc. Means early-adopters can help us find problems. Can have a PoC running whilst local-work is ongoing. Smaller number of jobs compared to local. HTCondor Week 2015 HTCondor at CERN 10

Worker Nodes 8-Core flavour VMs with 16GB RAM partitioned to 8 job slots. Standard WLCG nodes with glexec running a HTCondor version of MachineJob Features. MachineJob Features required by the experiments for communicating with their jobs and making decisions about the machine. HTCondor Week 2015 HTCondor at CERN 11

Queues Biggest departure from our current approach, visible from the outside world for Grid jobs. Following the HTCondor way with a single queue, seems to simplify most things. HTCondor Week 2015 HTCondor at CERN 12

Management The HTCondor Python Bindings have more than stepped up to the plate here. One of my favourite features/parts of HTCondor, can query anything and everything. Most of the management of the pool will be done via the Python-bindings. Most of our monitoring uses the classad projections to avoid unnecessary clutter. HTCondor Week 2015 HTCondor at CERN 13

Accounting Written a Python library that uses the classads library. Accountant daemon running that picks up HTCondor jobs as they finish. Backend agnostic, currently sends jobs data to an accounting db, elasticsearch and our in-house monitoring. Data also being pumped to HDFS and our analytics cluster for later analysis. HTCondor Week 2015 HTCondor at CERN 14

Monitoring HTCondor Week 2015 HTCondor at CERN 15

Current Status What have we achieved so far? All Grid items are at least PoC ready. Grid Pilot is open to ATLAS and CMS with 96 cores. Already had 40,000 ATLAS jobs, CMS starting shortly. HTCondor Week 2015 HTCondor at CERN 16

The Future Local jobs and taking the Grid PoC to production. HTCondor Week 2015 HTCondor at CERN 17

Distributed Schedulers 80% of our job-load will be local submission. Several thousand users all wanting to query schedulers... HTCondor Week 2015 HTCondor at CERN 18

Distributed Schedulers Problem: How do we assign users to a scheduler? Suggestions from Miron and Todd on this: A schedd that answers queries about all schedds. Hash job ids to schedds and embed in classad. One of ours: DNS delegated personalized addressing for each user. e.g. isteers.condor.cern.ch We like these ideas, but we d like to know how others have achieved this? HTCondor Week 2015 HTCondor at CERN 19

User Areas and Authorization Kerberos is needed for local jobs, for things such as access to user areas. Although not related directly to HTCondor, this is a good opportunity to review our existing solutions. Problem: Security of tickets and how to renew appropriately. Miron suggested letting the schedds handle renewal of tickets and passing to worker nodes. HTCondor Week 2015 HTCondor at CERN 20

Group Membership Easy for Grid jobs with relatively neat classad expressions. Not so obvious for local submission with 100s of group.subgroup combinations. Python classad functions may be our solution here but we have scaling concerns. HTCondor Week 2015 HTCondor at CERN 21

Open Issues Open Issues with no clear (or multiple) solutions: Mapping users to schedulers. Kerberos tickets and handling. Enforcing group membership. HTCondor Week 2015 HTCondor at CERN 22

Conclusion Local jobs and production service by end of year. Questions? HTCondor Week 2015 HTCondor at CERN 23