ATLAS operations in the GridKa T1/T2 Cloud

Similar documents
The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Improved ATLAS HammerCloud Monitoring for Local Site Administration

HammerCloud: A Stress Testing System for Distributed Analysis

Computing for LHC in Germany

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

Overview of ATLAS PanDA Workload Management

New data access with HTTP/WebDAV in the ATLAS experiment

Computing Model Tier-2 Plans for Germany Relations to GridKa/Tier-1

Monte Carlo Production on the Grid by the H1 Collaboration

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

The ATLAS Distributed Analysis System

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Popularity Prediction Tool for ATLAS Distributed Data Management

Andrea Sciabà CERN, Switzerland

The National Analysis DESY

Grid Computing Activities at KIT

LHCb Computing Resource usage in 2017

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Data transfer over the wide area network with a large round trip time

A copy can be downloaded for personal non-commercial research or study, without prior permission or charge

Striped Data Server for Scalable Parallel Data Analysis

NAF & NUC reports. Y. Kemp for NAF admin team H. Stadie for NUC 4 th annual Alliance Workshop Dresden,

Data Transfers Between LHC Grid Sites Dorian Kcira

Data preservation for the HERA experiments at DESY using dcache technology

Challenges of the LHC Computing Grid by the CMS experiment

DIRAC pilot framework and the DIRAC Workload Management System

LHCb Computing Strategy

Understanding the T2 traffic in CMS during Run-1

Evaluation of Apache Hadoop for parallel data analysis with ROOT

Austrian Federated WLCG Tier-2

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

UW-ATLAS Experiences with Condor

The High-Level Dataset-based Data Transfer System in BESDIRAC

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

IEPSAS-Kosice: experiences in running LCG site

High Throughput WAN Data Transfer with Hadoop-based Storage

A scalable storage element and its usage in HEP

ATLAS computing activities and developments in the Italian Grid cloud

DØ Southern Analysis Region Workshop Goals and Organization

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

The Global Grid and the Local Analysis

High Energy Physics data analysis

The ATLAS PanDA Pilot in Operation

Analysis of internal network requirements for the distributed Nordic Tier-1

Bringing ATLAS production to HPC resources - A use case with the Hydra supercomputer of the Max Planck Society

The Legnaro-Padova distributed Tier-2: challenges and results

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Distributed Data Management on the Grid. Mario Lassnig

DESY. Andreas Gellrich DESY DESY,

Virtualizing a Batch. University Grid Center

Computing / The DESY Grid Center

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

Data Management for the World s Largest Machine

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

ANSE: Advanced Network Services for [LHC] Experiments

Geant4 Computing Performance Benchmarking and Monitoring

Using IKAROS as a data transfer and management utility within the KM3NeT computing model

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Establishing Applicability of SSDs to LHC Tier-2 Hardware Configuration

Benchmark of a Cubieboard cluster

Benchmarking the ATLAS software through the Kit Validation engine

CMS High Level Trigger Timing Measurements

The CMS experiment workflows on StoRM based storage at Tier-1 and Tier-2 centers

CMS users data management service integration and first experiences with its NoSQL data storage

Data oriented job submission scheme for the PHENIX user analysis in CCJ

Long Term Data Preservation for CDF at INFN-CNAF

XCache plans and studies at LMU Munich

An Analysis of Storage Interface Usages at a Large, MultiExperiment Tier 1

The Wuppertal Tier-2 Center and recent software developments on Job Monitoring for ATLAS

Scalability and Performance Improvements in the Fermilab Mass Storage System

CHIPP Phoenix Cluster Inauguration

IllustraCve Example of Distributed Analysis in ATLAS Spanish Tier2 and Tier3

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

DESY at the LHC. Klaus Mőnig. On behalf of the ATLAS, CMS and the Grid/Tier2 communities

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

Availability measurement of grid services from the perspective of a scientific computing centre

Experience with PROOF-Lite in ATLAS data analysis

ATLAS DQ2 to Rucio renaming infrastructure

Tier 3 batch system data locality via managed caches

ATLAS distributed computing: experience and evolution

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Evolution of Database Replication Technologies for WLCG

Monitoring ARC services with GangliARC

The LHC Computing Grid

Influence of Distributing a Tier-2 Data Storage on Physics Analysis

Resilient FTS3 service at GridKa

The German National Analysis Facility What it is and how to use it efficiently

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

Lessons Learned in the NorduGrid Federation

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

A new petabyte-scale data derivation framework for ATLAS

Transcription:

Journal of Physics: Conference Series ATLAS operations in the GridKa T1/T2 Cloud To cite this article: G Duckeck et al 2011 J. Phys.: Conf. Ser. 331 072047 View the article online for updates and enhancements. Related content - ATLAS computing operations within the GridKa Cloud J Kennedy, C Serfon, G Duckeck et al. - Iberian ATLAS Cloud response during the first LHC collisions M Villaplana Perez, G Amorós, G Borges et al. - Examples of shared ATLAS Tier2 and Tier3 facilities S González de la Hoz, M Villaplana, Y Kemp et al. This content was downloaded from IP address 148.251.232.83 on 08/05/2018 at 11:56

ATLAS operations in the GridKa T1/T2 Cloud G Duckeck 1, T Harenberg 2, S Kalinin 2, G Kawamura 3, K Leffhalm 4, J Meyer 5, S Nderitu 6, A Olszewski 7, A Petzold 8, J Schultes 2, C Serfon 1, J E Sundermann 9, R Walker 1 For the ATLAS collaboration 1 LMU München 2 Universität Wuppertal 3 Universität Mainz 4 Desy-Zeuthen 5 Universität Göttingen 6 Universität Bonn 7 INP PAN Krakow 8 KIT Karlsruhe 9 Universität Freiburg E-mail: gduckeck@lmu.de Abstract. The ATLAS GridKa cloud consists of the GridKa Tier1 centre and 12 Tier2 sites from five countries associated to it. Over the last years a well defined and tested operation model evolved. Several core cloud services need to be operated and closely monitored: distributed data management, involving data replication, deletion and consistency checks; support for ATLAS production activities, which includes Monte Carlo simulation, reprocessing and pilot factory operation; continuous checks of data availability and performance for user analysis; software installation and database setup. Of crucial importance is good communication between sites, operations team and ATLAS as well as efficient cloud level monitoring tools. The paper gives an overview of the operations model and ATLAS services within the cloud. 1. Overview The ATLAS GridKa cloud is one of the largest and most heterogeneous of the 10 ATLAS Tier-1 clouds. It consists of 13 Tier-1 and Tier-2 sites (Fig. 1) in 5 countries. In addition there are several Tier-3 sites with non-pledged resources which participate in cloud operations. The combined resources add up to about 50 khs06 CPU (about 10k cores) and 5.5 PB disk storage. A further increase by about 30 % is planned for spring 2011. Over the last years a well defined and tested operation model evolved. Several core cloud services need to be operated and closely monitored: Distributed data management ATLAS production (MC simulation and real data reprocessing) Data access for analysis jobs Software installation and conditions database setup Published under licence by IOP Publishing Ltd 1

Figure 1. Map of ATLAS-GridKa cloud Equally important are regular processing challenges to optimize site performance and data throughput and to exercise new analysis workflows which show up after LHC start, e.g. largescale Root-based ntuple analyses. 1.1. Organisation In order to coordinate the regular operations there are weekly operation meetings between the Tier-1 and service coordinators. These are complemented by monthly cloud video meetings including Tier-1/Tier-2 contacts and service coordinators. Wiki pages are maintained for documentation and information. 2. Data Management - DDM The distribution and management of ATLAS data [1] is essential within the cloud and is one of the most challenging tasks. It includes Distribution of data from collisions, MC and reprocessing Organising space management (space tokens at sites etc) Data aggregation from MC production Replication of user data on request Data consistency checks and cleaning Data-recovery in case of data loss Since the start of LHC both data volume and traffic increased substantially, see Fig. 2. Overall more than 4 PB data are stored in the cloud About 50% of this is LHC collisions data from 2010 Peak rates exceeded 1 GB/s Replication of user data contributes significantly Data loss at sites due to disk server failures occurs regularly Automatic procedures are in place for quick recovery (if data available elsewhere on the Grid) [2] 2

Figure 2. Storage volume and usage split in ATLAS-GridKa cloud Figure 3. File access rate at GridKa Tier-1 dcache per day Several tools have been developed to monitor and analyse in more detail the data traffic and access patterns at the sites, complementing existing ATLAS DDM tools. Based on the detailed information in the dcache billing logs we can determine the fileaccess frequency of certain dataname patterns over time, distinguishing local (dcap) and remote (gridftp) access (Fig. 3). In addition, one can analyse remote data distribution: For each transfer one can extract the remote peer and then determine how much traffic goes to which sites, distinguishing incoming and outgoing connections. An example is given in Fig. 4, which shows the peer domain for gridftp transfers to or from the dcache storage at the Wuppertal Tier-2 site. 3. Monte Carlo Production Monte Carlo production is a continuous service within the cloud. There are two Panda pilot submission instances running to provide redundancy and scaling. A large fraction is handled by the Tier-2 sites and a few Tier-3 sites also contribute. The submission instances require 3

Figure 4. Remote domains for gridftp transfers from/to Wuppertal Tier-2 dcache in Sep 2010 Figure 5. Job statistic Tier-1 vs Tier-2 in GridKa cloud (Jan - Sep 2010) supervision and good contact is needed to the site managers to aid with problem spotting and solving The DE cloud contributes well to overall ATLAS Monte Carlo production: About 12 % in 2010 Up to 16000 jobs running simultaneously in the cloud. In recent years the job failure rate could be continuously reduced and we reach now a walltime-efficiency of 94 % (average in the cloud). Contributions from Tier-1 and Tier-2 are well balanced and the share of analysis jobs increased substantially since LHC start as shown in Fig. 5. 4. Distributed Analysis Since LHC start in spring 2010 group and user analysis jobs become much more prominent. Typical analysis jobs have much higher IO requirements than MC production; it is a challenge for sites to optimize overall throughput between compute cluster and storage systems and achieve good job efficiency. 4

Figure 6. dcache IO Rate at LRZ Tier-2 site as function of number of jobs Figure 7. Example of HammerCloud test result 4.1. I/O tests and optimization A dedicated tool ( dcacheloadtest ) has been developed in the cloud in order to systematically assess the IO performance between worker node cluster and storage systems. Series of jobs on the worker nodes read randomly files from a large dataset. This way one can determine both the integral IO rate as function of jobs running in parallel (Fig. 6) as well as the IO rate of single storage pools which helps to identify bottlenecks and optimize the overall IO rate. 4.2. ATLAS HammerCloud Tests The ATLAS HammerCloud system [3] is an invaluable tool to systematically assess and optimize the analysis performance of a site. We run regular tests of distributed analysis on all sites in the cloud to evaluate the performance and reliability of the sites and to compare options for data access, namely file-stagein to local disk versus direct IO via dcap/rfio to dcache/dpm storage systems. Figure 7 shows a recent example of a test, important goals are low job failure rate and decent CPU-time/Wall-time ratio. 5

Figure 8. Ntuple analysis performance test: The plots show job performance parameters (Number of files currently processed, IO rate and event rate) versus time, on the left as scatter plot for each job and on the right aggregated over all jobs. 4.3. Large scale Root based ntuple analysis Certain analysis use cases require multiple and fast iterations over large data samples. Parallel Grid/batch execution (e.g. splitting one 2-day job into 100 jobs of 30 min each) can be a competitive and widely available alternative to dedicated Proof clusters. However, this poses new challenges for the IO access patterns between worker-node and storage. Moreover, such an application needs a quick turnaround, less than one hour between job submission and finishing on the Grid would be desirable. Based on a real analysis case a realistic benchmark test-suite has been derived (G. Brandt/J. Samson/W. Ehrenfeld, Desy) for standard batch system submission. With few modifications this was ported to ATLAS Grid sub-mission and tested on several sites. An example test is shown in Fig. 8. A 1.1 TB ntuple dataset consisting of some 2000 files was processed via the ATLAS GangaPanda system [4] at the Desy-ZN Tier-2 site. The task got automatically split into 80 sub-jobs. The typical processing time varied between 30 and 50 min (depending on CPU type). Since the site was not fully loaded at that time all jobs started within 10 mins and the whole dataset processing was finished within one hour. 6

5. Conclusions The sites and operations in GridKa cloud reached a decent level of general performance and stability. The experience from running in 2010 showed that sites and operations can cope well with both the increased data volume and flow since LHC start as well as the much increased analysis activity. References [1] I. Ueda et al., ATLAS Operations: Experience and Evolution in the Data Taking Era, these proceedings. [2] C. Serfon et al., The consistency service of the ATLAS Distributed Data Management system, these proceedings. [3] D. Vanderster et al., HammerCloud: A Stress Testing System for Distributed Analysis, these proceedings. [4] J. Elmsheuser et al., Reinforcing User Data Analysis with Ganga in the LHC Era: Scalability, Monitoring and User-support Ganga, these proceedings. 7