What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

Similar documents
What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

Changing landscape of computing at BNL

Grid Compute Resources and Grid Job Management

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

Introducing the HTCondor-CE

An update on the scalability limits of the Condor batch system

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

CERN: LSF and HTCondor Batch Services

Singularity in CMS. Over a million containers served

WLCG Lightweight Sites

PROOF-Condor integration for ATLAS

HTCondor Week 2015: Implementing an HTCondor service at CERN

Grid Compute Resources and Job Management

Tutorial 4: Condor. John Watt, National e-science Centre

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

glideinwms: Quick Facts

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

Cloud Computing. Summary

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

EGEE and Interoperation

Teraflops of Jupyter: A Notebook Based Analysis Portal at BNL

The Fermilab HEPCloud Facility: Adding 60,000 Cores for Science! Burt Holzman, for the Fermilab HEPCloud Team HTCondor Week 2016 May 19, 2016

High Throughput Urgent Computing

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Monitoring and Analytics With HTCondor Data

Connecting Restricted, High-Availability, or Low-Latency Resources to a Seamless Global Pool for CMS

Interfacing HTCondor-CE with OpenStack: technical questions

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

Corral: A Glide-in Based Service for Resource Provisioning

70-532: Developing Microsoft Azure Solutions

Day 9: Introduction to CHTC

Singularity tests at CC-IN2P3 for Atlas

New Directions and BNL

Flying HTCondor at 100gbps Over the Golden State

Grid Mashups. Gluing grids together with Condor and BOINC

70-532: Developing Microsoft Azure Solutions

Scientific Computing on Emerging Infrastructures. using HTCondor

Containerized Cloud Scheduling Environment

Clouds in High Energy Physics

Investigating Containers for Future Services and User Application Support

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Bright Cluster Manager

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

glideinwms Frontend Installation

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

ARC integration for CMS

HTCondor on Titan. Wisconsin IceCube Particle Astrophysics Center. Vladimir Brik. HTCondor Week May 2018

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

Parsl: Developing Interactive Parallel Workflows in Python using Parsl

Monitoring HTCondor with the BigPanDA monitoring package

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

UW-ATLAS Experiences with Condor

Enabling Distributed Scientific Computing on the Campus

HTCondor Essentials. Index

XSEDE High Throughput Computing Use Cases

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

Introduction to SciTokens

Getting Started with OSG Connect ~ an Interactive Tutorial ~

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

Evolution of Cloud Computing in ATLAS

The PanDA System in the ATLAS Experiment

Improving User Accounting and Isolation with Linux Kernel Features. Brian Bockelman Condor Week 2011

Networking and High Throughput Computing. Garhan Attebury HTCondor Week 2015

Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules. Singularity overview. Vanessa HAMAR

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

ArcGIS Enterprise: Advanced Topics in Administration. Thomas Edghill & Moginraj Mohandas

Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison

Distributing storage of LHC data - in the nordic countries

Cloud Computing. Up until now

Configuring a glideinwms factory

Automatic Dependency Management for Scientific Applications on Clusters. Ben Tovar*, Nicholas Hazekamp, Nathaniel Kremer-Herman, Douglas Thain

Workload management at KEK/CRC -- status and plan

The SciTokens Authorization Model: JSON Web Tokens & OAuth

Andrej Filipčič

(HT)Condor - Past and Future

Opportunities for container environments on Cray XC30 with GPU devices

Presented by: Jon Wedell BioMagResBank

Clouds at other sites T2-type computing

ALICE Grid Activities in US

Multiprocessor Scheduling. Multiprocessor Scheduling

Multiprocessor Scheduling

Grid Computing Fall 2005 Lecture 5: Grid Architecture and Globus. Gabrielle Allen

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

Managing Deep Learning Workflows

HTCONDOR USER TUTORIAL. Greg Thain Center for High Throughput Computing University of Wisconsin Madison

Scalability and interoperability within glideinwms

What s new in Condor? What s coming? Condor Week 2011

Introduction to Programming and Computing for Scientists

The BOINC Community. PC volunteers (240,000) Projects. UC Berkeley developers (2.5) Other volunteers: testing translation support. Computer scientists

Bright Cluster Manager Advanced HPC cluster management made easy. Martijn de Vries CTO Bright Computing

Transcription:

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

Release Timeline Stable Series HTCondor v8.6.x - introduced Jan 2017 Currently at v8.6.11 Development Series (should be 'new features' series) HTCondor v8.7.x Currently at v8.7.8 New v8.8 stable series coming this summer Detailed Version History in the Manual http://htcondor.org/manual/latest/versionhistoryandreleasenotes.html 2

3

Enhancements in HTCondor v8.4 Scalability and stability Goal: 200k slots in one pool, 10 schedds managing 400k jobs Introduced Docker Job Universe IPv6 support Tool improvements, esp condor_submit Encrypted Job Execute Directory Periodic application-layer checkpoint support in Vanilla Universe Submit requirements New RPM / DEB packaging Systemd / SELinux compatibility 4

Enhancements in HTCondor v8.6 Enabled and configured by default: use single TCP port, cgroups, mixed IPv6 + IPv4, kernel tuning Made some common tasks easier Schedd Job Transforms Docker Universe enhancements: usage updates, volume mounts, conditionally drop capabilities Singularity Support 5

HTCondor Singularity Integration What is Singularity? Like Docker but No root owned daemon process, just a setuid No setuid required (post RHEL7) Easy access to host resources incl GPU, network, file systems HTCondor allows admin to define a policy (with access to job and machine attributes) to control Singularity image to use Volume (bind) mounts Location where HTCondor transfers files 6

Whats cooking in the kitchen for HTCondor v8.7 and beyond 7

Docker Job Enhancements Docker jobs get usage updates (i.e. network usage) reported in job classad Admin can add additional volumes That all docker universe jobs get Condor Chirp support Conditionally drop capabilities Support for condor_ssh_to_job 8

Not just Batch - Interactive Sessions Two uses for condor_ssh_to_job Interactive session alongside a batch job Debugging job, monitoring job Interactive session alone (no batch job) Juptyer notebooks, schedule shell access p.s. Jupyter Hub batchspawner supports HTCondor Can tell the schedd to run a specified job immediately! Interactive sessions, test jobs No waiting for negotiation, scheduling condor_i_am _impatient_and_waiting? 9

HTCondor Python API HTCondor V8.7 eliminated web-service SOAP API, long live Python! Started with Python 2, now also support Python 3 Packged into PyPI repository (Linux only) pip install htcondor Several API improvements and simplifications, e.g. much easier to submit jobs Can use condor_submit JDL directly, including queue for each Starting to think about higher-level job submit abstractions MS Windows now supported 10

Instantiate an HTCondor Annex to dynamically add additional execute slots for jobs submitted at your site Get status on an Annex Control which jobs (or users, or groups) can use an Annex Want to launch an Annex on Clouds HTCondor "Annex" Via cloud API (or Kubernetes?) HPC Centers / Supercomputers Via edge service (HTCondor-CE) 11

Grid Universe Reliable, durable submission of a job to a remote scheduler Popular way to send pilot jobs (used by glideinwms), key component of HTCondor-CE Supports many back end types: HTCondor PBS LSF Grid Engine Google Compute Engine Amazon AWS OpenStack Cream NorduGrid ARC BOINC Globus: GT2, GT5 UNICORE 12

Added Grid Universe support for Azure, SLURM, Cobalt Speak to Microsoft Azure Speak native SLURM protocol Speak to Cobalt Scheduler Argonne Leadership Computing Facilities Jaime: Grid Jedi Also HTCondor-CE "native" package HTCondor-CE started as an OSG package IN2P3 wanted HTCondor-CE without all the OSG dependencies. Now HTCondor-CE available stand-alone in HTCondor repositories 13

CPU cores! FNAL HEPCloud NOvA Run (via Annex at NERSC) http://hepcloud.fnal.gov/ 14

No internet access to HPC edge service? File-based Job Submission Schedd A JobXXX request input input input status.1 status.2 status.3 output output output Schedd B 15

condor_annex tool Start virtual machines as HTCondor execute nodes in public clouds that join your pool Leverage efficient AWS APIs such as Auto Scaling Groups and Spot Fleets Other clouds (Google, Microsoft) coming Secure mechanism for cloud instances to join the HTCondor pool at home institution Shut down idle instances, detect instances that fail to start HTCondor Implement a fail-safe in the form of a lease to make sure the pool does eventually shut itself off. 16

Compute node management enhancements Work on Noisy Neighbor Challenges Already use cgroups to manage CPU, Memory what about CPU L3 cache? Memory bus bandwidth? Working with CERN OpenLab and Intel on leveraging Intel Resource Directory Technology (RDT) in HTCondor Monitor utilization Assign shares 17

Compute node management enhancements, cont. Multi-core challenges. Low priority user submits millions of 1-core jobs; subsequently high priority user submit a 4-core job. What to do? Option 1: Draining HTCondor can now backfill draining nodes with preemptible jobs Option 2: Preemption HTCondor can now preempt multiple jobs, combine their resources, and assign to a higher priority job Initial implementation to gain experience at BNL; however we are still working on a cleaner model 18

Compute node management GPU Devices HTCondor can detect GPU devices and schedule GPU jobs New in v8.7: enhancements, cont. Monitor/report job GPU processor utilization Monitor/report job GPU memory utilization Future work: simultaneously run multiple jobs on one GPU device Volta hardware-assisted Mutli-Process Service (MPS) 19

Security: From identity certs to authorization tokens HTCondor has long supported GSI certs Then added Kerberos/AFS tokens for CERN, DESY Now adding standardized token support SciTokens (http://scitokens.org) OAuth 2.0 Workflow Box, Google Drive, Github, 20

Security, cont. Planning for US Federal Information Processing Standard (FIPS) Compliance Can do better than MD-5, 3DES, Blowfish AES has hardware support in most Intel CPUs May motivate us to drop UDP communications in HTCondor Almost all communication in HTCondor is now asynchronous TCP anyway Anyone care if UDP support disappears? 21

Scalability Enhancements Late materialization of jobs in the schedd to enable submission of very large sets of jobs More jobs materialized once number of idle jobs drops below a threshold (like DAGMan throttling) Central manager now manages queries Queries (ie condor_status calls) are queued; priority is given to operational queries More performance metrics (e.g. in collector, DAGMan) 22

Data work Job input files normally transferred to execute node over CEDAR, now can be sent over HTTP Enable caching (reverse and forward proxies), redirects More info from Carl Vuosalo's talk: https://tinyurl.com/yd6mya96 Proposed a project to manage data leases (size and time lease) for job output across HTCondor, Rucio, XRootD 23

Workflows Thinking about how to add "provisioning nodes" into a DAGMan workflow Provision an annex, run work, shutdown annex Working with Toil team so Toil workflow engine can submit jobs into HTCondor 24

25

Thank you! 26