ATLAS Computing: the Run-2 experience

Similar documents
ATLAS Distributed Computing Experience and Performance During the LHC Run-2

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Overview of ATLAS PanDA Workload Management

Popularity Prediction Tool for ATLAS Distributed Data Management

The ATLAS Distributed Analysis System

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

Experiences with the new ATLAS Distributed Data Management System

Distributed Data Management on the Grid. Mario Lassnig

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

New Features in PanDA. Tadashi Maeno (BNL)

ATLAS 実験コンピューティングの現状と将来 - エクサバイトへの挑戦 坂本宏 東大 ICEPP

Update of the Computing Models of the WLCG and the LHC Experiments

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

ANSE: Advanced Network Services for [LHC] Experiments

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

Grid Computing Activities at KIT

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

The ATLAS EventIndex: Full chain deployment and first operation

ATLAS Analysis Workshop Summary

ATLAS distributed computing: experience and evolution

Virtualizing a Batch. University Grid Center

PoS(EGICF12-EMITC2)106

ATLAS operations in the GridKa T1/T2 Cloud

Batch Services at CERN: Status and Future Evolution

A Popularity-Based Prediction and Data Redistribution Tool for ATLAS Distributed Data Management

The CMS Computing Model

UW-ATLAS Experiences with Condor

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

ATLAS & Google "Data Ocean" R&D Project

Big Data Tools as Applied to ATLAS Event Data

PanDA: Exascale Federation of Resources for the ATLAS Experiment

LHCb Computing Resources: 2018 requests and preview of 2019 requests

Improved ATLAS HammerCloud Monitoring for Local Site Administration

The ATLAS PanDA Pilot in Operation

HEP replica management

LHCb Computing Resource usage in 2017

150 million sensors deliver data. 40 million times per second

Data services for LHC computing

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

Andrea Sciabà CERN, Switzerland

Analytics Platform for ATLAS Computing Services

HammerCloud: A Stress Testing System for Distributed Analysis

AGIS: The ATLAS Grid Information System

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

BigData and Computing Challenges in High Energy and Nuclear Physics

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

LHCb Computing Strategy

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

The ATLAS Production System

Visita delegazione ditte italiane

IllustraCve Example of Distributed Analysis in ATLAS Spanish Tier2 and Tier3

Evolution of Cloud Computing in ATLAS

Machine Learning in Data Quality Monitoring

Data Transfers Between LHC Grid Sites Dorian Kcira

Elasticsearch & ATLAS Data Management. European Organization for Nuclear Research (CERN)

Early experience with the Run 2 ATLAS analysis model

N. Marusov, I. Semenov

Use to exploit extra CPU from busy Tier2 site

Introduction to Grid Computing

PROOF-Condor integration for ATLAS

DQ2 - Data distribution with DQ2 in Atlas

Status of KISTI Tier2 Center for ALICE

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

Data Management for the World s Largest Machine

The JINR Tier1 Site Simulation for Research and Development Purposes

Prompt data reconstruction at the ATLAS experiment

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Unified System for Processing Real and Simulated Data in the ATLAS Experiment

Data handling and processing at the LHC experiments

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Challenges of Big Data Movement in support of the ESA Copernicus program and global research collaborations

CouchDB-based system for data management in a Grid environment Implementation and Experience

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

1. Introduction. Outline

ATLAS DQ2 to Rucio renaming infrastructure

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

New data access with HTTP/WebDAV in the ATLAS experiment

Computing at the Large Hadron Collider. Frank Würthwein. Professor of Physics University of California San Diego November 15th, 2013

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

DIAL: Distributed Interactive Analysis of Large Datasets

The PanDA System in the ATLAS Experiment

Summary of the LHC Computing Review

arxiv: v1 [physics.ins-det] 1 Oct 2009

HIGH ENERGY PHYSICS ON THE OSG. Brian Bockelman CCL Workshop, 2016

Distributing storage of LHC data - in the nordic countries

Data Storage. Paul Millar dcache

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

Transcription:

ATLAS Computing: the Run-2 experience Fernando Barreiro Megino on behalf of ATLAS Distributed Computing KEK, 4 April 2017

About me SW Engineer (2004) and Telecommunications Engineer (2007), Universidad Autónoma de Madrid (Spain) Working on ATLAS Distributed Computing (ADC) since 2008 2008-2012: Distributed Data Management developer 2012-13: ATLAS Cloud Computing co-coordinator 2015-now: Workload Management developer and co-coordinator since April 2016 2013-2014: JP Morgan Technology Division in Geneva 2

Outline ATLAS workflows: the data processing chain ATLAS Distributed Computing in Run 2 The Worldwide LHC Computing Grid (WLCG) Distributed Data Management Distributed Workload Management ATLAS Distributed Computing: operations and support Conclusions 3

ATLAS workflows: the data processing chain 4

5

From collisions to papers in ATLAS? 150M sensors Collisions at 40MHz 6

The data processing chain Generation Event generator output (EVNT) Simulation Simulated interaction with detector (HITS) Trigger Digitization+Trigger Simulated detector output (RDO) Raw data (RAW) Reconstruction Reconstruction Analysis Object Data (AOD) Analysis Object Data (AOD) Organized production Derivation Derivation Derived AOD (DAOD) Chaotic analysis Detector data Analysis Simulated data Online Tier-0 Grid 7

The data processing chain Generation Event generator output (EVNT) Simulation Simulated interaction with detector (HITS) 2 level, online system (HW+SW) HW Trigger Digitization+Trigger Simulated detector output (RDO) Raw data (RAW) Reconstruction Reconstruction Reduce event rate from 40 MHz (60TB/s)Analysis to 1kHz Object Data (AOD) (1.6GB/s) based on signatures Derivation Event size ~1.6MB Analysis Object Data (AOD) Derivation Derived AOD (DAOD) Analysis Online Tier-0 Grid 8

The data processing chain Generation Event generator output (EVNT) Simulation Simulated interaction with detector (HITS) Trigger Digitization+Trigger Data quality assessment Raw data (RAW) Calibration Mast out noisy channels Reconstruction Reconstruct trajectory of particles throughanalysis detector Object Data (AOD) Simulated detector output (RDO) Reconstruction Analysis Object Data (AOD) Derivation Derivation Derived AOD (DAOD) Analysis Online Tier-0 Grid 9

The data processing chain Generation Event generator output (EVNT) Simulation Simulated interaction with detector (HITS) Trigger Digitization+Trigger Simulated detector output (RDO) Raw data (RAW) Reduce size of outputs to facilitate analysis. Skimming: remove Reconstruction events Analysis Object Data (AOD) Thinning: remove objects within events Slimming: remove Derivation variables within objects This step is done centrally to reduce resource needs by users Reconstruction Analysis Object Data (AOD) Derivation Derived AOD (DAOD) Analysis Online Tier-0 Grid 10

The data processing chain Generation Event generator output (EVNT) Monte Carlo production Event generation: Simulation simulate particle Simulated interactioninteractions with detector (HITS) Monte Carlo production Detector simulation: calculate how particles Digitization+Trigger interact with detector material Simulated detector output (RDO) Digitization: turn simulated energy into Reconstruction detector response that like the raw data Analysis Object Datalooks (AOD) from the detector Trigger Raw data (RAW) Reconstruction Analysis Object Data (AOD) Derivation Derivation Derived AOD (DAOD) Analysis Online Tier-0 Grid 11

The data processing chain Generation Event generator output (EVNT) Simulation Simulated interaction with detector (HITS) Trigger Digitization+Trigger Simulated detector output (RDO) Raw data (RAW) Reconstruction Reconstruction Analysis Object Data (AOD) Analysis Object Data (AOD) Organized production Derivation Derivation Derived AOD (DAOD) Chaotic analysis Detector data Analysis Simulated data Online Tier-0 Grid 12

ATLAS Distributed Computing (ADC) for Run 2 13

Worldwide LHC Computing Grid International collaboration to distribute and analyse LHC data Integrates computing centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists Tier-0 (CERN): data recording and archival, prompt reconstruction, calibration and distribution Tier-1s: T0 overspilling, second tape copy of detector data, more intensive tasks Tier-2s: Processing centers, being the differences with T1s increasingly blurry more later For all experiments: nearly 170 sites ~350k cores 200 PB of disk 10 Gb links and up 14

Worldwide LHC Computing Grid International collaboration to distribute and analyse LHC data Integrates computing centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists Tier-0 (CERN): data recording and archival, prompt reconstruction and calibration and distribution Tier-1s: T0 overspilling, second tape copy of detector data, memory and CPU intensive tasks Tier-2s: Processing centers, being the differences with T1s increasingly blurry more later For all experiments: nearly 170 sites ~350k cores 200 PB of disk 10 Gb links and up 15

ATLAS Grid Information System: AGIS List of attached sites Each site consists of multiple storage endpoints and batch queues Storage endpoint configurations Batch/PanDA queue configurations

Workload and data management system Central management ProdSys 2 User interface/cli JEDI (tasks) PanDA (jobs) Distributed queues Rucio (datasets and files) Distributed storage endpoints Distributed resources 17

ATLAS Distributed Data Management: Rucio Business emails sent 3000PB/year Source: Wired magazine (4/2013 - a bit outdated) Climate DB ~150PB disk ~100PB disk ~100PB tape ~150PB tape Run 1 (DQ2) LS 1 Run 2 (Rucio) LHC data 15PB/yr ~14x growth expected 2012-2020 Google search 100PB Lib of Congress Facebook Big Data? uploads 180PB/year Current ATLAS data set, considering all data products and replicas ~250 PB Nasdaq YouTube Kaiser 15PB/yr Permanente 30PB

Data Management: Rucio architecture 19

Rucio features and concepts Rucio accounts can be mapped to users or groups (e.g. Higgs) Namespace is partitioned by scopes (users, groups and other activities) Data ownership for users and groups: possibility to enable quota systems Replica management: rules define number of replicas and conditions on sites Granular data handling at file level - no external file catalogs Support of multiple protocols for file handling (access/copy/deletion) SRM, HTTP/WebDAV, gridftp Metadata storage: extensible key-value implementation System-defined: size, checksum, creation time Physics: number of events Production: job/task that created the file 20

Data policies and lifecycle Run 2 is very tight on space and relies on fully dynamic data replication and deletion Minimalistic pre-placement of only 2 replicas Data categories: Primary (resident): base replicas guaranteed to be available on disk. Not subject to automatic clean up Secondary (cache): extra replicas dynamically created and deleted according to the usage metrics Data rebalancing: redistribution of primary copies of popular datasets to disk resources with free space 21

Data policies and lifecycle Every dataset has a lifetime set at creation 6 months for Analysis inputs (DAODs) - fast turnaround 2-3 years for Monte-Carlo simulations expensive to regenerate Infinite for RAW Lifetime can be extended if the data is accessed Expired datasets can disappear any time 22

More tape and network usage More data and highly dynamic lifecycle: rely on tape and network Ongoing tests to explore the usage of tape Tape pledges not reached (~70%) Run derivations from tape Optimization of tape access needed Transfer volumes keep increasing LHCOPN fully utilized, including secondary network ~42PB/month = 16 GB/s 23

Data Management: some metrics Transfers Download >40M files/month Up to 40 PB/month 150M files/month 50 PB/month Deletion 100M files/month 40 PB/month 24

ATLAS Distributed Workload Management: PanDA 300k 150k 300k 2016 Pledged 150k Full grid utilization Resources on T1 and T2 sites are exploited beyond pledge (200% for T2s) Various types of resources: grid, cloud and HPCs 25

From requests to jobs Rucio Metadata Analysis tasks/jobs ProdSys2/ DEFT JEDI DEFT DB Requests and tasks pilot PanDA DB Tasks and jobs pilot PanDA server Jobs pilot Worker nodes Grids Clouds HPCs Volunteer computing 26

JEDI/PanDA workflow 1. Task brokerage: tasks are assigned to the nucleus site that will collect the task output. 2. 3. Job generation Job brokerage: jobs are assigned to processing satellite queues 4. Assignment based on data locality, remaining work, free storage, capability to run the jobs... Matching queue description: walltime limits, memory limits, #cores And other dynamic metrics: free space, transfer backlog to nucleus, data availability, #running/#queued jobs, network connectivity to nucleus Job dispatch: queued jobs are dispatched based on Global Shares targets 27

Task and job parameter auto-tuning Task and job parameters are tuned automatically Scout jobs collect real job metrics like memory and walltime ~10 scout jobs are generated at the beginning of each task Parameters for successive jobs in the task are optimized based on these metrics Retrial module acts on failed jobs Extending memory and walltime requirements for related types of errors Preventing jobs with irrecoverable errors - don t waste CPU time retrying jobs that will never succeed Rules for error codes and actions are configurable through ProdSys User Interface 28

Dynamic job definition Dynamically split workload for optimal usage of resources Manages workload at task, job, file and event level 29

WORLD cloud Original ATLAS Computing Model was designed as static clouds (mostly national or geographical groupings of sites), setting data transfer perimeters Tasks had to be inflexibly executed within a static cloud Output of tasks had to be aggregated in the Tier 1s (O(10)) This model had a series of shortcomings WLCG networks have evolved significantly in the last two decades and limiting transfers within a cloud is no longer needed Tier 2 storage was not optimally exploited and only contained secondary data High priority tasks were occasionally stuck at small clouds 30

WORLD cloud WORLD cloud new site concepts: Activation as nucleus Task nucleus: Any stable site (Tier 1 or Tier 2) can aggregate the output of a task. The capacity of being a nucleus is assigned manually based on past experience Task satellites: Will process jobs and send the output to the nucleus. The satellites are defined dynamically for each task and are not confined inside the cloud Fully activated March 2016 and nuclei progressively added 31

Global Shares Distribute the currently available compute resources amongst the activities Measure in currently used HS06 computing power Hierarchical implementation: siblings have the opportunity to inherit unused resources Currently only used for production shares, but in the future it will also be used for analysis vs production split 32

Opportunistic resources Centers willing to contribute to ATLAS, but not part of WLCG Reconfiguration of ATLAS online cluster Some of these centers have more computing power than the WLCG altogether HPC centers Shared academic clusters Academic and commercial Clouds Volunteer computing Even a backfill of leftover cycles (no dedicated allocation) is extremely interesting for us Need to adapt our systems to be able to fully exploit these offers 33

Opportunistic resources HPC Major HPC contributor is Titan running on purely backfill mode. Constraints on tasks it can run and still a lot of backfill to exploit further Cloud Beautiful example of how online farm is re-configured to run Grid jobs when idle. Also important, steady contribution from ATLAS@Home Grid vs cloud vs HPC 300k 100k 34

Tier-0 processing Tier-0 facility is a powerful cluster designed to cope with the data processing needs Powerful worker nodes: SSD, 4GB/core Switch from T0 data processing to grid workflows during periods without data 35 T0 data processing Grid workflows

Event Service Example: optimized NERSC Edison utilization with Event Service/Yoda One AthenaMP job drives 24 workers, one per core Optimized initialization time down to ~3 minutes (white) Yoda feeds events to workers until the batch slot is exhausted Blue = productive event processing time Only the last incomplete event is discarded (red) when the slot terminates 36

Ongoing effort: Harvester Workload management components were adapted ad-hoc to the different types of resources, as we were getting familiar with them Harvester targets to have a common machinery for all computing resources and provide a commonality layer in bringing coherence to HPC implementations PanDA Server get/update job kill pilot get, update, kill job request job or pilot request job or pilot Edge node Harvester Harvester request job or pilot VM pilot spin-up Harvester Cloud VObox submit/monitor/kill pilot subset of pilot components compute nodes HPC batch system get/update job kill pilot increase or throttle or submit pilots submit pilot VObox pilot scheduler or CE Worker = pilot, VM, MPI job pilot Grid site 37

ATLAS Distributed Computing: operations and support 38

DAST: Distributed Analysis Support Team ADCoS: ATLAS Distributed Computing operations Shift DPA: Distributed Production and Analysis (~WM Ops) CRC: Computing Run Coordinator ADC support & ops model Analysis users Escalation DDM Ops Developers Production groups DAST ADCoS shifters DPA CRC Central services Users 1st level Cloud squads Site admins 2nd level 3rd level 39

ADC shifts Distributed Analysis Shift Team (DAST) ATLAS Distributed Computing Operations Shifts (ADCoS) Shifts cover EU and US time zones First point of contact to address analysis questions Escalate questions/issues to experts 24/7 follow the sun, not presential shifts Follow failing jobs and transfers, service degradations, etc. Report to sites/clouds or escalate to CRC/experts Computing Run Coordinator (CRC) Shifts are 1 week long and presential at CERN. Stand by Coordinates daily ADC operations Main link within ADC communities and representation in WLCG ops meetings Facilitates communication between ADC shifters (in particular ADCoS) and the ADC experts Requires a certain expertise level 40

Monitoring: DDM Dashboard 41

Monitoring: BigPanDA 42

Analytics Traces and Job data is streamed to ElasticSearch Facilitates analytics and easy aggregation and filters Example: Identify incoherent user behaviour, such as individual users running own MC production or occupying non-negligible amounts of resources, can be easily identified 43

Wrapping up 44

Conclusions 2016 was a very successful year for ATLAS: more data recorded than anticipated ATLAS Computing Model adapted to increasing resource constraints ATLAS Distributed Computing was challenged, but proved successful Components are heavily automated and resilient Components present no scaling issues Minimalistic data pre-placement, relying on dynamic transfers and deletions according to usage patterns Dynamic job generation and optimization of resource usage Dependence on optimal exploitation of opportunistic compute resources Software moving in coherent direction, optimizing CPU and memory consumption HL-LHC era will be far more intense and we need to start preparing now! See Simone s presentation for details 45

Reference material ATLAS ATLAS Distributed Computing (ADC) V. Garonne: Experiences with the new ATLAS Distributed Data Management System ADC Workload Management T. Wenaus: Computing Overview A. Filipcic: ATLAS Distributed Computing Experience and Performance During the LHC Run-2 C. Serfon: ATLAS Distributed Computing ADC Data Management J. Catmore: From collisions to papers ATLAS Resource Request for 2014 and 2015 T. Maeno: The Future of PanDA in ATLAS Distributed Computing T. Maeno: Harvester ADC Operations and Support C. Adams: Computing shifts to monitor ATLAS Distributed Computing infrastru operations 46