UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

Similar documents
Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

The Echo Project An Update. Alastair Dewhurst, Alison Packer, George Vasilakakos, Bruno Canning, Ian Johnson, Tom Byrne

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Clouds at other sites T2-type computing

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Clouds in High Energy Physics

Using a dynamic data federation for running Belle-II simulation applications in a distributed cloud environment

LHCb experience running jobs in virtual machines

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Operating the Distributed NDGF Tier-1

The INFN Tier1. 1. INFN-CNAF, Italy

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

UW-ATLAS Experiences with Condor

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Past. Inputs: Various ATLAS ADC weekly s Next: TIM wkshop in Glasgow, 6-10/06

Data Transfers Between LHC Grid Sites Dorian Kcira

Distributed Monte Carlo Production for

HIGH ENERGY PHYSICS ON THE OSG. Brian Bockelman CCL Workshop, 2016

Edinburgh (ECDF) Update

Singularity tests at CC-IN2P3 for Atlas

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

The ATLAS Distributed Analysis System

Overview of ATLAS PanDA Workload Management

Evolution of Cloud Computing in ATLAS

CMS Belgian T2. G. Bruno UCL, Louvain, Belgium on behalf of the CMS Belgian T2 community. GridKa T1/2 meeting, Karlsruhe Germany February

Distributing storage of LHC data - in the nordic countries

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

The PanDA System in the ATLAS Experiment

Grid Computing Activities at KIT

XCache plans and studies at LMU Munich

The Legnaro-Padova distributed Tier-2: challenges and results

Lessons Learned in the NorduGrid Federation

Introduction to SciTokens

The LHC Computing Grid

ATLAS Tier-3 UniGe

An Analysis of Storage Interface Usages at a Large, MultiExperiment Tier 1

CernVM-FS beyond LHC computing

Experiences with the new ATLAS Distributed Data Management System

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Influence of Distributing a Tier-2 Data Storage on Physics Analysis

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Market Report. Scale-out 2.0: Simple, Scalable, Services- Oriented Storage. Scale-out Storage Meets the Enterprise. June 2010.

New data access with HTTP/WebDAV in the ATLAS experiment

Outline. ASP 2012 Grid School

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

ANSE: Advanced Network Services for [LHC] Experiments

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

LCG data management at IN2P3 CC FTS SRM dcache HPSS

Scheduling Computational and Storage Resources on the NRP

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

Andrej Filipčič

The EU DataGrid Testbed

Use to exploit extra CPU from busy Tier2 site

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

Flying HTCondor at 100gbps Over the Golden State

150 million sensors deliver data. 40 million times per second

Analytics Platform for ATLAS Computing Services

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Grid Mashups. Gluing grids together with Condor and BOINC

Data Storage. Paul Millar dcache

Status of KISTI Tier2 Center for ALICE

RESEARCH DATA DEPOT AT PURDUE UNIVERSITY

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

Science-as-a-Service

ATLAS Oracle database applications and plans for use of the Oracle 11g enhancements

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

Monitoring and Analytics With HTCondor Data

Introducing the HTCondor-CE

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Was ist dran an einer spezialisierten Data Warehousing platform?

ATLAS Experiment and GCE

The Wuppertal Tier-2 Center and recent software developments on Job Monitoring for ATLAS

Bootstrapping a (New?) LHC Data Transfer Ecosystem

Grid Data Management

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

Beating the Final Boss: Launch your game!

Andrea Sciabà CERN, Switzerland

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

Batch Services at CERN: Status and Future Evolution

HEP Grid Activities in China

Title: Episode 11 - Walking through the Rapid Business Warehouse at TOMS Shoes (Duration: 18:10)

How to set up SQL Source Control The short guide for evaluators

Grids and Security. Ian Neilson Grid Deployment Group CERN. TF-CSIRT London 27 Jan

Ganga - a job management and optimisation tool. A. Maier CERN

Cloud du CCIN2P3 pour l' ATLAS VO

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

SARDINA. The road to OpenStack at (one of?) UK s largest academic private cloud: University of Edinburgh

The SciTokens Authorization Model: JSON Web Tokens & OAuth

Making use of and accounting for GPUs. Daniel Traynor, GridPP39

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Site Summary Computing news Ongoing Computing Activities. Updates from Edinburgh Computing

Running HEP Workloads on Distributed Clouds

ATLAS COMPUTING AT OU

Optimizing a News Website for Speed and Stability

Remote Journal 101. By Robert Andrews

Transcription:

UK Tier-2 site evolution for ATLAS Alastair Dewhurst

Introduction My understanding is that GridPP funding is only part of the story when it comes to paying for a Tier 2 site. Each site is unique. Aim to describe various options and let sites decide what is best for them. Where possible, I have tried to site my sources. If not, then assume it is my opinion!

Hardware profile Storage Capacity Can we make use of this hardware? VO minimum requirement Normal Tier 2 Transition CPU only Time The UK may well be unique in having sites with significant storage capacity that gets decommissioned within the next 3-5 years.

Available FTE FTE profile Small amount of time for development/upgrade work Required to maintain reliable service Normal Tier 2 + Transition CPU only Time Solutions need to be deployed before the manpower drops.

CPU Only - UCL Model UCL was migrated to CPU only a while ago. More recently the French Cloud migrated some (non- French) sites to CPU only[1]. UCL Panda Queues (PQs) points to QMUL DDM endpoints. For all reads and writes. Two problems with extending this approach: 1. Extra CPU could overload the other sites storage. 2. If other site is down, then your resources are wasted. [1] https://indico.cern.ch/event/609911/contributions/2605033/attachments/1479532/2293620/wlcg_2017_diskless_ev_final.pdf

Lessons from Echo Adding a new storage endpoint at RAL, taught me a lot about the limitations/assumptions of ATLAS systems: A Panda Queue (PQ) can only talk to one storage endpoint. Hence, we have two sites in AGIS. You don t need much to run a site. Two separate sites in AGIS: - RAL-LCG2 - RAL-LCG2-ECHO The same batch farm is defined exactly the same way for both sites. Both storage endpoints can take full load of Batch farm. Echo PQs Castor PQs Tier 1 Batch Farm When Echo first started working for ATLAS, we had the backend storage, GridFTP access and a.json file which I manually edited for storage accounting. Echo Castor

ATLAS Vision ADC Technical Interchange Meeting (TIM) 20 th - 22 nd September [1]. Torre outlined his vision for his term as Computing Coordinator[2]. 7 [A]ES - ATLAS Event Service. Jobs can process as little as one event. Makes use of opportunistic resources such as HPCs. ESS - Event Streaming Service. Extension to AES, uses a prefetcher to pull in data a little at time for the job. CDN - Content Delivery Network. (Lets copy Netflix?!) Data Lake - A bit like a federation of storage endpoints [1] https://indico.cern.ch/event/654433/ [2] https://indico.cern.ch/event/570497/contributions/2726102/attachments/1524794/2384292/compthemeswenaus.pdf

ATLAS Development 8 There is a general consensus that we will need to have fewer replicas of data in future. More remote access for jobs will be necessary. The data management team gave a talk on long term planning[1]. Primarily focused on improving what we have via better monitoring etc. The pilot team gave a talk on future developments[2]. Lots of stuff being re-written. Alternative stage out not high on priority list. (ie being able to write output to a different site) [1] https://indico.cern.ch/event/654433/contributions/2716593/attachments/1528356/2390751/2017-09-22_-_data_management_-_long_term_planning.pdf [2] https://indico.cern.ch/event/654433/contributions/2712321/attachments/1528329/2390701/pilot_2_cern_tim_22_september_2017.pdf

Closeness 9 ATLAS define Closeness as: The maximum throughput over one hour in a one month period. Convert this throughput in TB/s and take -log(throughput in TB/s). Closeness of 0 = Same site Closeness of 1 = 100 GB/s Closeness of 2 = 10 GB/s... Closeness of 10 = 0.1 kb/s Closeness of 11 = No data

Federations (DynaFed) FAX was killed off by ATLAS. Insufficient manpower (return on investment) I do not believe an http federation will be any more successful. Webdav support is currently less well deployed than XRootD. What about DynaFed? It could provide a solution for future experiments, that integrate it from the start. RAL (Echo) uses it just as an authentication frontend for the S3. I don t think ATLAS need a federation, they have the information to target jobs to the best available site. 10

XRootD Proxy Cache XCache is the new name for XRootD Proxy Cache! Old disk servers could be configured as an XRootD Proxy Cache. We have some experience already from using them for CMS AAA. Echo is copying RALPP s setup. ATLAS are trialing XCache at BNL and some US Tier 2, for use with ESS. I am slightly concern it may become a requirement. Note ATLAS and CMS use case is slightly different. WN running CMS jobs anywhere connect to an XCache dedicated to a site. WN running ATLAS jobs at a site can connect to any other site via XCache. 11 [1] http://xrootd.org/doc/dev42/pss_config.htm [2]fhttps://indico.cern.ch/event/330212/contributions/1718785/attachments/642383/883832/PFC-Details-Status.pdf

ARC Cache 12 If the site has a shared file system (between CE and WN), then ARC Cache could be a useful choice to improve job performance. At a recent Ceph meeting[1], SiGNET gave a presentation on their CephFS setup which is used as an ARC cache as well as for local users. The UK does have plenty of experience with ARC CEs, but not with ARC Cache. [1] https://indico.cern.ch/event/669931/contributions/2740166/attachments/1533388/2401265/luminous_on_signet.pdf

Analysis Facility 13 We know that users tend to prefer their own site if using LOCALGROUPDISK. Analysis facility has a large LOCALGROUPDISK (a SCRATCHDISK but no DATADISK) Tried before with Sussex but wasn t used. No existing user base. If there is a site with an active analysis base, who wanted this, then there is no reason it can t work.

Flexible batch farm 14 In the long term, if a site is only going to have CPU resources for the Grid then maybe it should just focus on having a really flexible batch farm? Brian Bockelman gave a presentation recently on fancy new features in HTCondor. https://indico.cern.ch/event/654433/contributions/27 24530/attachments/1526468/2386983/HTCondorCE- ATLAS-TIM-2017.pdf

Conclusions 15 If we can keep the storage reliable, I suspect ATLAS will allow us to bend the rules as the amount of storage on some sites shrink. I think we should investigate [further] XCache. My concern is that this could become another expected service at all sites rather than something useful for just CPU only sites.