Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

Similar documents
Flying HTCondor at 100gbps Over the Golden State

Introducing the HTCondor-CE

glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

glideinwms: Quick Facts

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

An update on the scalability limits of the Condor batch system

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Configuring a glideinwms factory

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

CCB The Condor Connection Broker. Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

The HTCondor CacheD. Derek Weitzel, Brian Bockelman University of Nebraska Lincoln

Introduction to Distributed HTC and overlay systems

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

Scientific Computing on Emerging Infrastructures. using HTCondor

CHTC Policy and Configuration. Greg Thain HTCondor Week 2017

Networking and High Throughput Computing. Garhan Attebury HTCondor Week 2015

UW-ATLAS Experiences with Condor

Factory Ops Site Debugging

CMS experience of running glideinwms in High Availability mode

Corral: A Glide-in Based Service for Resource Provisioning

Connecting Restricted, High-Availability, or Low-Latency Resources to a Seamless Global Pool for CMS

Scalability and interoperability within glideinwms

glideinwms Frontend Installation

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Enabling Distributed Scientific Computing on the Campus

BOSCO Architecture. Derek Weitzel University of Nebraska Lincoln

ATLAS Experiment and GCE

Monitoring and Analytics With HTCondor Data

Interfacing HTCondor-CE with OpenStack: technical questions

High Throughput WAN Data Transfer with Hadoop-based Storage

LHCb experience running jobs in virtual machines

Changing landscape of computing at BNL

Primer for Site Debugging

New Directions and BNL

Batch Services at CERN: Status and Future Evolution

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

Singularity in CMS. Over a million containers served

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

AliEn Resource Brokers

(HT)Condor - Past and Future

Special Topics: CSci 8980 Edge History

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

Open Science Grid LATBauerdick/Fermilab

First Principles Vulnerability Assessment. Motivation

Outline. ASP 2012 Grid School

The Grid: Processing the Data from the World s Largest Scientific Machine

ALICE Grid Activities in US

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Pegasus WMS Automated Data Management in Shared and Nonshared Environments

HTCondor Week 2015: Implementing an HTCondor service at CERN

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Edinburgh (ECDF) Update

WLCG Lightweight Sites

Cloud Computing. Summary

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

EGEE and Interoperation

The Fermilab HEPCloud Facility: Adding 60,000 Cores for Science! Burt Holzman, for the Fermilab HEPCloud Team HTCondor Week 2016 May 19, 2016

Day 9: Introduction to CHTC

Austrian Federated WLCG Tier-2

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

SLATE. Services Layer at the Edge. First Meeting of the National Research Platform Montana State University August 7-8, 2017

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Building the International Data Placement Lab. Greg Thain

Data services for LHC computing

What s new in Condor? What s coming? Condor Week 2011

PROOF-Condor integration for ATLAS

MSG: An Overview of a Messaging System for the Grid

Monitoring HTCondor with the BigPanDA monitoring package

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

The Legnaro-Padova distributed Tier-2: challenges and results

Cloud Computing with Nimbus

Testing the limits of an LVS - GridFTP cluster as a replacement for BeSTMan

Welcome to HTCondor Week #16. (year 31 of our project)

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

HTCondor with KRB/AFS Setup and first experiences on the DESY interactive batch farm

Tutorial 4: Condor. John Watt, National e-science Centre

CRAB 2 and CRABServer at UCSD

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Solving Hard Integer Programs with MW

Network Analytics. Hendrik Borras, Marian Babik IT-CM-MM

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Evolution of Cloud Computing in ATLAS

What s new in Condor? What s c Condor Week 2010

Grid monitoring with MonAlisa

Using ssh as portal The CMS CRAB over glideinwms experience

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

Transcription:

Shooting for the sky: Testing the limits of condor 21 May 2015 Edgar Fajardo On behalf of OSG Software and Technology 1

Acknowledgement Although I am the one presenting. This work is a product of a collaborative effort from: HTCondor development team GlideInWMS development team HTCondor UW CHTC (A Moate), UCSD T2 (Terrence M) and CMS FNAL T1 (Burt H), who provided the hardware for the testbed. CMS Computing and WLCG sites provided the worker node resources. OSG Software team. 2

Vanilla Condor on a Slide Legend: Schedd Central Manger Worker node Proudly providing HTC for more than 20 years 3

GlideInWMS on a Happily working for more than 6 years!!! slide Legend: Schedd Central Manger Worker node 4

Vanilla HTCondor vs GlideInWMS in the numbers Vanilla Condor GlideinWMS Number of nodes O(1k) O(10k) # Different types of machines Location of Schedds and WN O(10) Private O(100) WAN 5

The Challenge: CMS Hi OSG Software folks For Run II, we would like to have a single global pool of 200,000 running jobs, u think Condor can handle it? Wow, We would expect so. Nobody has tried that big of a pool Nobody has found the Higgs before we and ATLAS did Touche Btw, u can use some of my WLCG resources to test. About 20k slots Just 20k? Didn t u say u want a pool of 200k? Nvm we will figure it out. 6

The Challenge in the numbers CMS Requirements for LHC Run II # Startd 200,000 Autoclusters (types of machines) 600 Number of Schedds <5 High Availability YES!!! Challenge 7 Accepted!!!

How to get 200k slots? Gathering from a commercial cloud for example*: Without using your credit card Our Solution: The UberGlideIn *At spot pricing and without using the Uber GlideIn 8

Uber GlideIn Normal GlideIn 1 Core This works because jobs are sleep jobs. UberGlideIn 1 Core GlideIn pilot_cpu_benchmark.py GlideIn One Master Master 1 Master 2 Master 3 Master n One Startd Startd_1 Stard_2 Startd_3 Startd_n 9

Moreover we wanted them distributed all over Why? In production, the network latency of having Startd all over the WAN is known to cause problems. See Bradley et all 10

Does it work? 150k Startd ~20k real core slots Yes it does!!! 11

Well most of the time it works: Hi OSG Software folks PIC Admin About those tests you are running at our site. Just FYI u brought down our firewall, while using > 10k ports Ohh dude, sorry about that. We will fix it Mental note: Talk to the HTCondor dev team to reduce the long lived TCP connections from the WN to the outside (Collector, Negotiator ) 12

Now we have the resources, lets test: For ~ 3 months Report/ Brainstorm Test Ganglia features was key. THAnKS!!! Fix GANGLIAD_REQUIREMENTS = MyType =!= "Machine" 13

Did it work? YES!!! One full weekend 14

It was not all a bed of roses, maybe it was 200,000 Max parallel Running Jobs single HTCondor Pool 200000 150,000 150000 Max Parallel Running Jobs 100,000 92000 50,000 0 8.3.0 8.3.1 8.3.2 15

HTCondor Improvements For more details see Todd s talk on What s new in HTCondor 2015? Non blocking GSI authentication at the Collector Shared port at the worker node. In general reduce # of long lived TCP connections. Removed file locking at the Schedd Reduced incoming TCP connections at the Schedd Batched resources request from the Collector 16

Scale Improvements throughout History Max Parallel Running Jobs 200,000 150,000 100,000 50,000 Max parallel Running Jobs single HTCondor Pool in latest series CHEP 08 CHEP 09 90000 150000 95000 200000 CHEP 2015 0 2000 10000 20000 7.0.1 7.1.3 7.3.1 7.5.5 8.3.0 8.3.1 8.3.2 HTCondor Series 17

Ahh, One more thing Hi Edgar, Brian B Since u and Jeff are working on the scaling tests, what about we scale test our new rockstar: The HTCondor CE. About the size of UNL sleeper pool ~16k parallel running jobs? Sounds good, Which levels are u looking for? Are u kidding me? That is twice as much of what any OSG site needs? 18

Step back: What is a CE? 19

HTCondor-CE in a slide Submit Host HTCondor Schedd HTCondor-CE Job (grid universe) HTCondor-C submit HTCondor Case HTCondor-CE Schedd CE Job Submit Host HTCondor Schedd Job (grid universe) PBS Job Router Transform PBS Case HTCondor Schedd HTCondor-C submit HTCondor Job (vanilla) HTCondor-CE Schedd Job Router Transform Grid Job Routed Job (grid uni) HTCondor-CE blahp-based transform PBS PBS Job HTCondor 20

What is HTCondor- CE? Condor + Configuration For more details see OSG AHM 2015 or CHEP 2015 talks. 21

Did we make it? YES!!! After some development cycles: 16 k Jobs achieved UNL sleeper batch started to swap Better than any Gram5 known to date 22

HTCondor CE In the Numbers: HTCondor-CE GRAM 5 Best max running jobs 16k * 10k Network Port usage (per running job 2 4 StartUp Rate 70 jobs/min 55 jobs/min* *Disclaimer: This tests were done on different hardware with 5 years in between them. 23

Conclusions The OSG Software team, in conjunction with HTCondor and GlideinWMS development teams have collaborated to push the scalability limits of a single HTCondor pool The HTCondor-CE is ready to rock and roll 24

Questions? Contact us at: 1-900-scale-masters 25

Just Kidding Contact us: osg-software@opensciencegrid.org 26