Batch Services at CERN: Status and Future Evolution

Similar documents
Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

Storage and I/O requirements of the LHC experiments

The LHC computing model and its evolution. Dr Bob Jones CERN

Distributed e-infrastructures for data intensive science

Service withdrawal: Selected IBM ServicePac offerings

CERN Lustre Evaluation

Five years of OpenStack at CERN

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

IBM offers Software Maintenance for additional Licensed Program Products

The LHC Computing Grid

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

CERN: LSF and HTCondor Batch Services

HTCondor Week 2015: Implementing an HTCondor service at CERN

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Patent Portfolio Overview July The data in this presentation is current as of this date.

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Items exceeding one or more of the maximum weight and dimensions of a flat. For maximum dimensions please see the service user guide.

International Packets

Clouds in High Energy Physics

Developments in Manufacturing Technologies Research Co-operation between Riga Technical University and CERN

Experience of the WLCG data management system from the first two years of the LHC data taking

ETSI Governance and Decision Making

OnAudience.com I Report 2017 Ad blocking in the Internet

Oracle Enterprise Manager 12 c : ASH in 3D

DIRAC pilot framework and the DIRAC Workload Management System

BoR (11) 08. BEREC Report on Alternative Voice and SMS Retail Roaming Tariffs and Retail Data Roaming Tariffs

ATLAS Experiment and GCE

International Business Mail Rate Card

International Roaming Critical Information Summaries JULY 2017

CUSTOMER GUIDE Interoute One Bridge Outlook Plugin Meeting Invite Example Guide

Visita delegazione ditte italiane

HPC IN EUROPE. Organisation of public HPC resources

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Clouds at other sites T2-type computing

Summary of the LHC Computing Review

Map Reconfiguration Dealer Guide

Grid Computing a new tool for science

European Standardization & Digital Transformation. Ashok GANESH Director Innovation ETICS Management Committee

First Experience with LCG. Board of Sponsors 3 rd April 2009

The LHC Computing Grid. Slides mostly by: Dr Ian Bird LCG Project Leader 18 March 2008

EUROPEAN READY-MIXED CONCRETE INDUSTRY STATISTICS YEAR

InfoPrint 6500 line matrix printer family features new intelligent cartridge ribbon system

ENHANCED INTERIOR GATEWAY ROUTING PROTOCOL STUB ROUTER FUNCTIONALITY

Virtualizing a Batch. University Grid Center

BoR (10) 13. BEREC report on Alternative Retail Voice and SMS Roaming Tariffs and Retail Data Roaming Tariffs

European Cybersecurity PPP European Cyber Security Organisation - ECSO November 2016

Patent Portfolio Overview May The data in this presentation is current as of this date.

European Cybersecurity cppp and ECSO. org.eu

The Role of SANAS in Support of South African Regulatory Objectives. Mr. Mpho Phaloane South African National Accreditation System

MINUTES AND TEXTS CUSTOMER MOBILE BOLT-ON GUIDE JUNE 2018 BOLT-ON WILL KEEP YOU IN CONTROL OF YOUR COSTS. INTERNATIONAL NUMBERS FROM YOUR MOBILE, THIS

EU Telecoms Reform package 2007 Comments/ questions

Connected for less around the world Swisscom lowers its roaming tariffs again. Media teleconference 12 May 2009

This document is a preview generated by EVS

Automation DriveServer

EUMETSAT EXPERIENCE WITH MULTICAST ACROSS GÉANT

EUREKA European Network in international R&D Cooperation

Power Analyzer Firmware Update Utility Version Software Release Notes

Overview. About CERN 2 / 11

Tackling tomorrow s computing challenges today at CERN. Maria Girone CERN openlab CTO

Rural broadband and its implications for the future of Universal Service. The Israeli Case

PROOF-Condor integration for ATLAS

The Canadian Experience

esignature Infrastructure Marketing Model

The IECEE CB Scheme facilitates Global trade of Information Technology products.

Map Reconfiguration User Guide

AN POST SCHEDULE OF CHARGES

Light Quality and Energy Efficiency The CIE Approach

Digital EAGLEs. Outlook and perspectives

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Euro-IX update. EIX WG Ripe 53 Amsterdam. Serge Radovcic. Euro-IX update. EIX WG RIPE53 Amsterdam. Oct 5th 2006

UW-ATLAS Experiences with Condor

E R T M S COMMUNICATION PLAN

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

CRE investment weakens in Q as investors struggle to find product in prime markets

KNX Japan KNX The Success Story

The CMS Computing Model

Mexico s Telecommunications Constitutional Reform, the Shared Network and the Public - Private Collaboration. MBB Forum Shanghai, China

Overcoming the Compliance Challenges of VAT Remittance. 12 April :55 to 16:30 (CEST)

CERN and Scientific Computing

END-OF-SALE AND END-OF-LIFE ANNOUNCEMENT FOR THE CISCO MEDIA CONVERGENCE SERVER 7845H-2400

CHIPP Phoenix Cluster Inauguration

EU funded research is keeping up trust in digital society

EventBuilder.com. International Audio Conferencing Access Guide. This guide contains: :: International Toll-Free Access Dialing Instructions

Installation and user manual. PSTN module

Carrier Services. Intelligent telephony. for over COUNTRIES DID NUMBERS. All IP

iclass SE multiclass SE 125kHz, 13.56MHz 125kHz, 13.56MHz

Reprocessing DØ data with SAMGrid

This document is a preview generated by EVS

IEPSAS-Kosice: experiences in running LCG site

STANDARD BROADBAND & FIBRE BROADBAND PLANS

Devices for LV overvoltage protection : Called Surge Protective Device (SPD) for Low Voltage. Different from high voltage : «surge arrester»

We invented the Web. 20 years later we got Drupal.

The 13 th Progress Report on the Single European Telecoms Market 2007: Frequently Asked Questions

Europol The Police Intelligence Agency of the European Union

Phase II Upgrades. Eckhard Elsen. LHC RRB Meeting, Oct 29-31, Director Research and Computing

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Global Economic Indicators: Global Leading Indicators

IGEL-Briefing March Managed Software and Hardware Thin Clients

KNÜRR TECHNICAL FURNITURE YOUR WORKPLACE SPECIALISTS

Sonae/PT: Implications for fixed-line markets. Giulio Federico ACE Meeting 2008 Budapest, November

Transcription:

Batch Services at CERN: Status and Future Evolution Helge Meinhard, CERN-IT Platform and Engineering Services Group Leader HTCondor Week 20 May 2015 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 1

CERN 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 2

CERN 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 2

CERN International organisation close to Geneva, straddling Swiss-French border, founded 1954 1954: 12 Member States 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 2

CERN International organisation close to Geneva, straddling Swiss-French border, founded 1954 Facilities for fundamental research in particle physics 1954: 12 Member States 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 2

CERN International organisation close to Geneva, straddling Swiss-French border, founded 1954 Facilities for fundamental research in particle physics 21 member states, 1 B CHF budget 1954: 12 Member States Members: Austria, Belgium, Bulgaria, Czech republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, Netherlands, Norway, Poland, Portugal, Slovak Republic, Spain, Sweden, Switzerland, United Kingdom Candidate for membership: Romania Associate member: Serbia Observers: European Commission, India, Japan, Russia, Turkey, UNESCO, United States of America Numerous non-member states with collaboration agreements 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 2

CERN International organisation close to Geneva, straddling Swiss-French border, founded 1954 Facilities for fundamental research in particle physics 21 member states, 1 B CHF budget 3 581 staff, fellows, students, apprentices, 1954: 12 Member States Members: Austria, Belgium, Bulgaria, Czech republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, Netherlands, Norway, Poland, Portugal, Slovak Republic, Spain, Sweden, Switzerland, United Kingdom Candidate for membership: Romania Associate member: Serbia Observers: European Commission, India, Japan, Russia, Turkey, UNESCO, United States of America Numerous non-member states with collaboration agreements 2 513 staff members, 566 fellows, 481 students, 21 apprentices 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 2

CERN International organisation close to Geneva, straddling Swiss-French border, founded 1954 Facilities for fundamental research in particle physics 21 member states, 1 B CHF budget 3 581 staff, fellows, students, apprentices, 11 000 users 20-May-2015 1954: 12 Member States Members: Austria, Belgium, Bulgaria, Czech republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, Netherlands, Norway, Poland, Portugal, Slovak Republic, Spain, Sweden, Switzerland, United Kingdom Candidate for membership: Romania Associate member: Serbia Observers: European Commission, India, Japan, Russia, Turkey, UNESCO, United States of America Numerous non-member states with collaboration agreements 2 513 staff members, 566 fellows, 481 students, 21 apprentices 6 700 member states, 1 800 USA, 900 Russia, 230 Japan, CERN batch status and evolution - Helge Meinhard at CERN.ch 2

CERN Science for peace International organisation close to Geneva, straddling Swiss-French border, founded 1954 Facilities for fundamental research in particle physics 21 member states, 1 B CHF budget 3 581 staff, fellows, students, apprentices, 11 000 users 20-May-2015 1954: 12 Member States Members: Austria, Belgium, Bulgaria, Czech republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, Netherlands, Norway, Poland, Portugal, Slovak Republic, Spain, Sweden, Switzerland, United Kingdom Candidate for membership: Romania Associate member: Serbia Observers: European Commission, India, Japan, Russia, Turkey, UNESCO, United States of America Numerous non-member states with collaboration agreements 2 513 staff members, 566 fellows, 481 students, 21 apprentices 6 700 member states, 1 800 USA, 900 Russia, 230 Japan, CERN batch status and evolution - Helge Meinhard at CERN.ch 2

Tools: LHC and Detectors Exploration of a new energy frontier in p-p and Pb-Pb collisions LHC ring: 27 km circumference 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 3

Tools: LHC and Detectors CMS ATLAS General Purpose, proton-proton, heavy ions Discovery of new physics: Exploration Higgs, of a SuperSymmetry new energy frontier in p-p and Pb-Pb collisions LHC ring: 27 km circumference 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 3

Tools: LHC and Detectors pp, B-Physics, CP Violation (matter-antimatter symmetry) LHCb CMS ATLAS General Purpose, proton-proton, heavy ions Discovery of new physics: Exploration Higgs, of a SuperSymmetry new energy frontier in p-p and Pb-Pb collisions LHC ring: 27 km circumference 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 3

Tools: LHC and Detectors pp, B-Physics, CP Violation (matter-antimatter symmetry) LHCb CMS ATLAS General Purpose, proton-proton, heavy ions Discovery of new physics: Exploration Higgs, of a SuperSymmetry new energy frontier in p-p and Pb-Pb collisions ALICE LHC ring: 27 km circumference Heavy ions, pp (state of matter of early universe) 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 3

Results so far Many the most spectacular one being 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 4

Results so far Many the most spectacular one being 04 July 2012: Discovery of a Higgs-like particle 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 4

Results so far Many the most spectacular one being 04 July 2012: Discovery of a Higgs-like particle 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 4

Results so far Many the most spectacular one being 04 July 2012: Discovery of a Higgs-like particle 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 4

Results so far Many the most spectacular one being 04 July 2012: Discovery of a Higgs-like particle March 2013: The particle is indeed a Higgs boson 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 4

Results so far Many the most spectacular one being 04 July 2012: Discovery of a Higgs-like particle March 2013: The particle is indeed a Higgs boson 08 Oct 2013 / 10 Dec 2013: Nobel price to Peter Higgs and François Englert CERN, ATLAS and CMS explicitly mentioned 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 4

What is the data? 150 million sensors deliver data 40 million times per second Up to 6 GB/s to be permanently stored after filtering Almost 30 PB/y in Run 1 Expect ~50 PB/y in Run 2 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 5

The Worldwide LHC Computing Grid An International collaboration to distribute and analyse LHC data 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 6

The Worldwide LHC Computing Grid An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 6

The Worldwide LHC Computing Grid An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 6

The Worldwide LHC Computing Grid An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists Tier-0 (CERN): data recording, reconstruction and distribution 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 6

The Worldwide LHC Computing Grid An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists Tier-0 (CERN): data recording, reconstruction and distribution Tier-1: permanent storage, re-processing, analysis 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 6

The Worldwide LHC Computing Grid An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists Tier-0 (CERN): data recording, reconstruction and distribution Tier-1: permanent storage, re-processing, analysis Tier-2: Simulation, end-user analysis 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 6

The Worldwide LHC Computing Grid An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists Tier-0 (CERN): data recording, reconstruction and distribution nearly 170 sites, 40 countries ~350 000 cores Tier-1: permanent storage, re-processing, analysis Tier-2: Simulation, end-user analysis 500 PB of storage > 2 million jobs/day 10-100 Gb links 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 6

WLCG Resources [khs06] 2014 2015 2016 Tier-0 All Tier-0 All Tier-0 All ALICE 90 366 175 495 215 609 ATLAS 111 856 205 1 175 257 1 343 CMS 121 738 271 1 071 317 1 417 LHCb 34 218 36 240 51 315 Total 356 2 178 687 2 981 840 3 684 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 7

WLCG Resources [khs06] 2014 2015 2016 Tier-0 All Tier-0 All Tier-0 All ALICE 90 366 175 495 215 609 ATLAS 111 856 205 1 175 257 1 343 CMS 121 738 271 1 071 317 1 417 LHCb 34 218 36 240 51 315 Total 356 2 178 687 2 981 840 3 684 One x86 core: 6 15 HS06 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 7

WLCG Resources [khs06] 2014 2015 2016 Tier-0 All Tier-0 All Tier-0 All ALICE 90 366 175 495 215 609 ATLAS 111 856 205 1 175 257 1 343 CMS 121 738 271 1 071 317 1 417 LHCb 34 218 36 240 51 315 Total 356 2 178 687 2 981 840 3 684 One x86 core: 6 15 HS06 At CERN: Some capacity provided in addition for analysis (Tier-3) Experiments choose to split pledge across batch, cloud, and service nodes 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 7

Current Situation Batch at CERN Currently (08 May) deployed: 4 058 worker nodes (of which 3 669 virtual) 58 488 cores 530 khs06 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 8

Current Situation Batch at CERN Currently (08 May) deployed: 4 058 worker nodes (of which 3 669 virtual) 58 488 cores 530 khs06 Some 400 000 jobs per day, mostly singlethreaded (one core) 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 8

Current Situation Batch at CERN Currently (08 May) deployed: 4 058 worker nodes (of which 3 669 virtual) 58 488 cores 530 khs06 Some 400 000 jobs per day, mostly singlethreaded (one core) Mix of local and Grid submission Grid: Experiment frameworks submit to Cream CEs Grid amounts to 20 40% of submissions at CERN 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 8

Current Situation Batch at CERN Currently (08 May) deployed: 4 058 worker nodes (of which 3 669 virtual) 58 488 cores 530 khs06 Some 400 000 jobs per day, mostly singlethreaded (one core) Mix of local and Grid submission Grid: Experiment frameworks submit to Cream CEs Grid amounts to 20 40% of submissions at CERN Some 25 000 more cores to come before Run 2 physics 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 8

Workload Management Since the late 1990s, CERN has been using a commercial product: Platform Inc. s Load Sharing Facility LSF 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 9

Workload Management Since the late 1990s, CERN has been using a commercial product: Platform Inc. s Load Sharing Facility LSF Platform Inc. was acquired by IBM in 2011/2012 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 9

Workload Management Since the late 1990s, CERN has been using a commercial product: Platform Inc. s Load Sharing Facility LSF Platform Inc. was acquired by IBM in 2011/2012 CERN s licence is perpetual, maintenance is currently covered until November 2017 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 9

Workload Management Since the late 1990s, CERN has been using a commercial product: Platform Inc. s Load Sharing Facility LSF Platform Inc. was acquired by IBM in 2011/2012 CERN s licence is perpetual, maintenance is currently covered until November 2017 We are running release 7.0.6 Releases 8 and 9 are out; no significant advantages for CERN 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 9

Pain Points with LSF (1) Goal 30 000 50 000 worker nodes Dynamic cluster LSF constraint Max. ~ 6 500 worker nodes Adding/removing worker nodes requires cluster reconfiguration 10 100 Hz dispatch rate Transient dispatch problems sometimes difficult to ensure 1 Hz 100 Hz query scaling Slow query / submission response times, queries affect submissions Licence-free system Licensed product 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 10

Pain Points with LSF (2) Worker node scaling: Needed as resources grow by more than 100% from 2014 to 2016; unclear what future distribution of batch vs. cloud resources will be 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 11

Pain Points with LSF (2) Worker node scaling: Needed as resources grow by more than 100% from 2014 to 2016; unclear what future distribution of batch vs. cloud resources will be Limit appears architecture-related (some central processes single-threaded) 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 11

Pain Points with LSF (2) Worker node scaling: Needed as resources grow by more than 100% from 2014 to 2016; unclear what future distribution of batch vs. cloud resources will be Limit appears architecture-related (some central processes single-threaded) Limit already constrains us to use unnaturally large VMs (whole hypervisor) 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 11

Pain Points with LSF (2) Worker node scaling: Needed as resources grow by more than 100% from 2014 to 2016; unclear what future distribution of batch vs. cloud resources will be Limit appears architecture-related (some central processes single-threaded) Limit already constrains us to use unnaturally large VMs (whole hypervisor) Limit not changed significantly with LSF 8/9 Can set up multiple instances that can submit to each other 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 11

Pain Points with LSF (3) Cluster dynamism: LSF reconfigurations are expensive at least some 10 minutes of unresponsiveness 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 12

Pain Points with LSF (3) Cluster dynamism: LSF reconfigurations are expensive at least some 10 minutes of unresponsiveness We are running it once per day Sometimes reconfiguration fails, leading to loss of queues etc. 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 12

Pain Points with LSF (3) Cluster dynamism: LSF reconfigurations are expensive at least some 10 minutes of unresponsiveness We are running it once per day Sometimes reconfiguration fails, leading to loss of queues etc. Some operations require two reconfigurations, hence up to 48 hours of delay to become effective 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 12

Pain Points with LSF (4) Query rate: LSF is not (cannot be) protected against users hammering the system with expensive queries 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 13

Pain Points with LSF (4) Query rate: LSF is not (cannot be) protected against users hammering the system with expensive queries Number of cases in the past where submissions and job dispatch were seriously affected by query activity 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 13

Pain Points with LSF (4) Query rate: LSF is not (cannot be) protected against users hammering the system with expensive queries Number of cases in the past where submissions and job dispatch were seriously affected by query activity For ATLAS Tier-0 processing for Run 2, separate LSF instance established 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 13

Alternatives to LSF 7 (1) LSF 8 or 9 Not really addressing any one of our pain points 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 14

Alternatives to LSF 7 (1) LSF 8 or 9 Not really addressing any one of our pain points PBS offsprings Way too much trouble reported by other LCG sites 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 14

Alternatives to LSF 7 (1) LSF 8 or 9 Not really addressing any one of our pain points PBS offsprings Way too much trouble reported by other LCG sites SLURM Considered because of claimed scalability Good for many cores for massively parallel computing, serious scaling limits on worker nodes and job slots 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 14

Alternatives to LSF 7 (1) LSF 8 or 9 Not really addressing any one of our pain points PBS offsprings Way too much trouble reported by other LCG sites SLURM Considered because of claimed scalability Good for many cores for massively parallel computing, serious scaling limits on worker nodes and job slots Grid Engine Univa Grid Engine is the only serious contender left Commercial, similar architecture to LSF 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 14

Alternatives to LSF 7: HTCondor Open-source, academic environment 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 15

Alternatives to LSF 7: HTCondor Open-source, academic environment Already in widespread use in WLCG, e.g. FNAL, BNL, RAL good experience CERN s requirements are different: CERN cluster already largest and growing; CERN needs to also support local job submission with AFS token passing/extension 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 15

Alternatives to LSF 7: HTCondor Open-source, academic environment Already in widespread use in WLCG, e.g. FNAL, BNL, RAL good experience CERN s requirements are different: CERN cluster already largest and growing; CERN needs to also support local job submission with AFS token passing/extension HTCondor also used in experiment frameworks (and even as a CE ), can be used as cloud scheduler Potential for future further integration 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 15

Alternatives to LSF 7: HTCondor Open-source, academic environment Already in widespread use in WLCG, e.g. FNAL, BNL, RAL good experience CERN s requirements are different: CERN cluster already largest and growing; CERN needs to also support local job submission with AFS token passing/extension HTCondor also used in experiment frameworks (and even as a CE ), can be used as cloud scheduler Potential for future further integration Tests so far very successful Adding/removing worker nodes Failing central manager/submission nodes unproblematic Query scaling revealed an issue, fixed by developers very soon after 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 15

Alternatives to LSF 7: HTCondor Open-source, academic environment Already in widespread use in WLCG, e.g. FNAL, BNL, RAL good experience CERN s requirements are different: CERN cluster already largest and growing; CERN needs to also support local job submission with AFS token passing/extension HTCondor also used in experiment frameworks (and even as a CE ), can be used as cloud scheduler Potential for future further integration Tests so far very successful Adding/removing worker nodes Failing central manager/submission nodes unproblematic Query scaling revealed an issue, fixed by developers very soon after Scaling test (shadows on LSF worker nodes) looked promising 2 central managers, 20 schedulers/submission nodes, 1 300 worker nodes with 62 500 job slots Architecture promises to support further scale-out (unlike LSF, GE, SLURM etc.) 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 15

HTCondor Scaling Behaviour Job submission time as function of number of worker nodes and total number of jobs LSF HTCondor 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 16

HTCondor Deployment Steps (1) Start with a (small) service offering Grid submission only Mostly transparent to users Doesn t require AFS token passing and extension 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 17

HTCondor Deployment Steps (1) Start with a (small) service offering Grid submission only Mostly transparent to users Done see following talk by Iain Steers Doesn t require AFS token passing and extension 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 17

HTCondor Deployment Steps (1) Start with a (small) service offering Grid submission only Mostly transparent to users Done see following talk by Iain Steers Doesn t require AFS token passing and extension Grow that service (up to taking all Grid submissions) Overflowing into LSF part via condor_glidein possible 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 17

HTCondor Deployment Steps (2) Once necessary developments done, open small service for local job submissions Still to be seen to what extent we can (and wish!) to make condor submission look like LSF submission, idem for queries User support (documentation, handholding, tutorials etc.) will be integral part of deployment (and take significant resources!) 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 18

HTCondor Deployment Steps (2) Once necessary developments done, open small service for local job submissions Still to be seen to what extent we can (and wish!) to make condor submission look like LSF submission, idem for queries User support (documentation, handholding, tutorials etc.) will be integral part of deployment (and take significant resources!) Grow to full size, reducing LSF capacity Close interaction with user community 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 18

HTCondor Deployment Timescale Grid submissions: see Iain s talk 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 19

HTCondor Deployment Timescale Grid submissions: see Iain s talk Timescale for local submission developments and service to be defined Hoping for pilot by end 2015, but Priority is on full scale and production quality service for Grid submissions 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 19

HTCondor Deployment Timescale Grid submissions: see Iain s talk Timescale for local submission developments and service to be defined Hoping for pilot by end 2015, but Priority is on full scale and production quality service for Grid submissions Target: Terminate LSF service by end of Run 2 20-May-2015 CERN batch status and evolution - Helge Meinhard at CERN.ch 19