The Echo Project An Update. Alastair Dewhurst, Alison Packer, George Vasilakakos, Bruno Canning, Ian Johnson, Tom Byrne

Similar documents
UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

Data Transfers Between LHC Grid Sites Dorian Kcira

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

CouchDB-based system for data management in a Grid environment Implementation and Experience

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Scheduling Computational and Storage Resources on the NRP

LHConCRAY. Acceptance Tests 2017 Run4 System Report Miguel Gila, CSCS August 03, 2017

Outline. ASP 2012 Grid School

What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

The Legnaro-Padova distributed Tier-2: challenges and results

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

The INFN Tier1. 1. INFN-CNAF, Italy

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Computing activities in Napoli. Dr. Silvio Pardi (INFN-Napoli) Belle II Italian collaboration meeting 21 November 2017 Pisa - Italy

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

Data Access and Data Management

LCG data management at IN2P3 CC FTS SRM dcache HPSS

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Bootstrapping a (New?) LHC Data Transfer Ecosystem

LHCb experience running jobs in virtual machines

Visita delegazione ditte italiane

Edinburgh (ECDF) Update

Ceph-based storage services for Run2 and beyond

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Open vstorage RedHat Ceph Architectural Comparison

SUSE Enterprise Storage Technical Overview

An Analysis of Storage Interface Usages at a Large, MultiExperiment Tier 1

Flying HTCondor at 100gbps Over the Golden State

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

ATLAS Experiment and GCE

Streamlining CASTOR to manage the LHC data torrent

Analytics Platform for ATLAS Computing Services

Using a dynamic data federation for running Belle-II simulation applications in a distributed cloud environment

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

9 May Swifta. A performant Hadoop file system driver for Swift. Mengmeng Liu Andy Robb Ray Zhang

A fields' Introduction to SUSE Enterprise Storage TUT91098

Enabling a SuperFacility with Software Defined Networking

PROOF-Condor integration for ATLAS

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

Introduction to SciTokens

SolidFire and Ceph Architectural Comparison

LHC and LSST Use Cases

Block Storage Service: Status and Performance

The LHC Computing Grid

Singularity in CMS. Over a million containers served

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

Evolution of the HEP Content Distribution Network. Dave Dykstra CernVM Workshop 6 June 2016

Ceph Rados Gateway. Orit Wasserman Fosdem 2016

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Expert Days SUSE Enterprise Storage

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Virtualization of the ATLAS Tier-2/3 environment on the HPC cluster NEMO

CernVM-FS beyond LHC computing

Elasticsearch & ATLAS Data Management. European Organization for Nuclear Research (CERN)

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

CERN European Organization for Nuclear Research, 1211 Geneva, CH

SLATE. Services Layer at the Edge. First Meeting of the National Research Platform Montana State University August 7-8, 2017

Monitoring of large-scale federated data storage: XRootD and beyond.

OSiRIS Overview and Challenges Ceph BOF, Supercomputing 2018, Dallas

Ceph at the DRI. Peter Tiernan Systems and Storage Engineer Digital Repository of Ireland TCHPC

Andrea Sciabà CERN, Switzerland

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Data services for LHC computing

Using Puppet to contextualize computing resources for ATLAS analysis on Google Compute Engine

WHAT S NEW IN LUMINOUS AND BEYOND. Douglas Fuller Red Hat

Samba and Ceph. Release the Kraken! David Disseldorp

Datacenter Storage with Ceph

Benchmarking third-party-transfer protocols with the FTS

WLCG Network Throughput WG

The Google File System

where the Web was born Experience of Adding New Architectures to the LCG Production Environment

Global Software Distribution with CernVM-FS

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

XCache plans and studies at LMU Munich

Lessons Learned in the NorduGrid Federation

dcache Ceph Integration

Philippe Charpentier PH Department CERN, Geneva

Exploring cloud storage for scien3fic research

Introducing the HTCondor-CE

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

High Throughput WAN Data Transfer with Hadoop-based Storage

Towards Network Awareness in LHC Computing

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS

Opportunities for container environments on Cray XC30 with GPU devices

Storage and I/O requirements of the LHC experiments

and the GridKa mass storage system Jos van Wezel / GridKa

First Experience with LCG. Board of Sponsors 3 rd April 2009

Status of KISTI Tier2 Center for ALICE

CEPHALOPODS AND SAMBA IRA COOPER SNIA SDC

Grid Computing Activities at KIT

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

Summary of the LHC Computing Review

Introducing SUSE Enterprise Storage 5

Transcription:

The Echo Project An Update Alastair Dewhurst, Alison Packer, George Vasilakakos, Bruno Canning, Ian Johnson, Tom Byrne

What is Echo? Service S3 / Swift XrootD / GridFTP Ceph cluster with Erasure Coded data pools Disk only storage to replace CASTOR disk for LHC VOs 65 36 disk nodes (5TB disks) initially only 40 nodes used 10PB+ usable space, largest Ceph cluster using EC in production (of which we are aware) Echo EC 8+3 GridFTP and XRootD have now been in production for ~6 months S3 / DynaFed to enter production before the end of the year Cluster

S3 / DynaFed S3 API to Ceph is provided by the widely used RADOS Gateway DynaFed provides secure access to Echo S3 S3/Swift credentials stored on DynaFed Box (VO never needs to see it). Authentication to Dynafed can be anything Apache understands. We are currently using certificate/proxy. Provides file system like structure. Because DynaFed is developed by CERN, it supports transfers to existing Grid storage. 1. Proxy + request User 2. Pre-signed URL DynaFed 3. Data S3 / Swift

ATLAS Since April, ATLAS have been using Echo in production At the end of July, we reached the intended pledged amount for the year It has been a reliable and performant service All the major incidents (discussed later) have affected ATLAS WAN GridFTP via FTS Production jobs GridFTP Analysis jobs XRDCP with GridFTP writes (GridFTP failover for reads)

CMS Extensive testing of all CMS activities has been performed CMS workflow utilizes direct I/O more than Atlas Caching and prefetching needed, this is being accomplished at the XRootD gateway layer We have a clear migration plan It was decided to commission Echo as a separate PhEDEx site Formal commissioning work is ongoing Redirection between Echo and CASTOR to catch edge cases Separate XrootD proxy cache cluster for AAA (using old CASTOR hardware)

LHCb & Alice LHCb have been able to transfer data into Echo via FTS Demonstrated redirection between Echo and CASTOR works Chris Haen is developing gfal plugin for DIRAC to allow LHCb jobs to access Echo Intended to start work with ALICE in September 2017 Rob A and Catalin C were out in CERN this week, and had discussions with ALICE storage experts ALICE use case similar to CMS but with their own XRootD authentication layer

LIGO LIGO Data Replicator GridFTP Cardiff We have a functioning prototype Allows LIGO to access data securely via CVMFS Work ongoing to simplify LIGO getting data into Echo New GridFTP DSI to S3 could allow LDR to directly upload data to Echo Or ask LDR to support WebDav DynaFed Proxy + request Pre-signed URL Echo (S3) CVMFS Stratum 1 Squids via Paul Hopkins WebDav script Analysis jobs

XRootD & GridFTP plugin bugs There have been a few bugs Check-summing didn t work (Fixed). Redirection didn t work (Fixed). Caching proxy didn t work (Fixed). N2N component didn t work (Needed for CMS tests, now fixed). File overwrite doesn t work. Memory usage concerns in both plugins

Operational perspective Day to day running has been fairly smooth We re getting about one callout a week Mainly things that Ceph has handled, but intervention required to manually resolve (inconsistencies found during scrubbing etc.) A lot of work can be done to automate these things, but taking it slow and safe for now Disk removal/replacement is a bit of a time sink currently Cluster operations have been mostly transparent to VOs kernel patching placement group increases storage node removal

Issue 1 Gateway memory usage In June the gateway machines started swapping and becoming unresponsive the I/O usage on Echo was high, but not unusual, ATLAS were running normal workflows at RAL Determining the specific cause of the extra memory usage was not trivial More specific gateway and transfer metrics collection was put into place

Issue 1 Gateway memory usage XRootD (and GridFTP) Ceph plugin memory usage during reads is proportional to the size of the file being transferred A large number of jobs that were requesting large (~10GB) files started shortly before the GWs started swapping Problem specific to Ceph plugin and the buffer size used Writes are serial and use two buffers max, but reads reassemble all stripes in parallel so the entire object stripes are allocated buffers Tests reducing the buffer size (64MB 4MB) reduced the memory footprint ~15 times with a 10% reduction in single transfer speed Improved monitoring (Kibana and specific gateway Grafana dashboards) has been a huge help in diagnosing operational issues since this issue It s become clear that detailed metric collection and log parsing infrastructure is a vital part of a large Ceph cluster.

Issue 2 Data loss Sequence: Added the remaining 25 storage nodes from the 2015 generation into Echo in early August Disks had been tested when they were racked, but left idle for the ~9 months following An OSD was removed due to read errors This was the primary OSD of a backfilling placement group The first 3 OSDs in the set started crashing, flapping and finally died Removing the problem OSD(s) didn t help Ultimately, the PG was manually removed and recreated from all OSDs in the set, causing loss of ~23K ATLAS files Only 4000 files were unique, lower than usual for an incident of this type at the T1, as we were able to briefly restore the PG and some high priority files were copied off. An object storage daemon (OSD) is responsible for storing objects on a local file system, and providing access to them over the network A placement group is a section of a logical object pool that is mapped onto a set of object storage daemons

Data loss continued This turned out to be caused partly by a bug in the EC backfilling code http://tracker.ceph.com/issues/18162 During a backfill, if a read error occurs anywhere in the PG, the primary OSD will crash This is then followed by the next OSD to assume the primary duties crashing in the same manner Eventually enough OSDs crash to render the PG incomplete It is likely that further complications caused the data loss to be unavoidable There was a separate bug that meant we were seeing an incorrect/unhelpful error message for the backfill bug crash The first day of working on this problem was spent upgrading Ceph to the latest Kraken version, which fixed this bug

Data loss conclusion We have seen ~4% of the new disks start reporting errors If you re adding disks that were tested some time ago, it might be worth giving them a shakedown!! Our disk acceptance testing will need tweaking for Echo Currently the gateways Ceph plugins do not gracefully handle inactive PGs Connections hang and continue to consume resources on gateways Working on reducing the impact of this We are getting better at dealing with internal Ceph issues Subsequent occurrences have been resolved without data loss

Future plans The S3 and DynaFed services will become production Echo offerings before the end of the year Aim to upgrade to Luminous in November backfilling bug will be fixed in Luminous The 2016 generation of disk servers will be installed with BlueStore OSDs BlueStore is a new OSD backend OSD consumes data disk as a raw block device Substantial performance improvement

Conclusion Busy getting VOs and services up and running, and actively looking for new users There have been some operational issues One incident was serious, the ATLAS 23K file data loss There will be a post mortem Every incident has been a learning experience We haven t been bitten by the same thing twice All issues that have occurred were identified in the risk review The first 6 months of Echo in production have encouraging Confidence in Echo is growing

Thank you

New VO Services FTS3 wlcg VO Data Systems S3 S3 gsiftp gsiftp xroot radosgw globus-gridftp-server rados plugin libradosstriper xrootd rados plugin libradosstriper RADOS Internal XRootD access via proxies on WNs RADOS VO data pool Ceph Cluster rgw pool

65 XMA storage nodes Echo hardware 36 5TB data disks per node 40 nodes used for the initial cluster, the remainder were added in August 5 monitor nodes Large SSD s for cluster map storage Plans to add these to the UPS circuit, and move 2 to a separate DC 5 external gateway nodes 128/192GB RAM 4x10G Ethernet

Gateway Monitoring Examples

XRootD/GridFTP AuthZ/AuthN Client Gateway Storage Authentication Authorization Client Software Gridmap File AuthDB xrdcp / globus-url-copy Is user valid? Yes Is user allowed to perform op? Yes Error No No Data libradosstriper Gridmap file used for authentication Authorisation is done via XRootD s authdb Ian Johnson added support for this in the GridFTP plugin

XRootD Architecture XrootD Cmsd External VO Redirectors All XRootD gateways will have a caching proxy: On External gateways it will be large as AAA from other sites doesn t use LD. On WN gateways it will be small and used to protect against pathological jobs. XrootD Cmsd Gateway XrootD Cmsd Gateway WN will be installed with an XrootD Gateway. This will allow direct connection to Echo. 80% of batch farm now using SL7 + containers. Xrootd Xrootd Xrootd Ceph Backend WN WN WN Xrootd Xrootd Xrootd WN WN WN

XRootD Caches When even one byte of data is requested from an Erasure Coded object it will have to be completely reassembled. This happens on the primary OSD for the PG the object is in. ATLAS Jobs configured to copy-to-scratch whole files. CMS jobs need to access data directly from the storage. Tested Lazy-download (which downloads 64MB objects at a time) Can t use Lazy-download with federated XRootD (AAA) access. Lazy-download appear to add a significant overhead to certain types of jobs. Solution to add Caches to the gateways. Currently testing memory cache (as opposed to disk cache). Initial testing (by Andrew Lahiff) looks promising

AvgEventTime EventThroughput TotalJobCPU TotalJobTime CASTOR 142.303 0.024478 6490.24 2043.1 CASTOR (with lazy-dowload) (1) 137.518 0.0253456 6255.06 1975.19 CASTOR (with lazy-dowload) (2) 133.403 0.0258622 6073.56 1933.73 Echo (remote gateway) (1) 531.954 0.0071882 8390.39 6956.31 Echo (remote gateway) (2) 477.632 0.0079287 7432.37 6306.67 Echo (remote gateway, with lazydownload) (1) Echo (remote gateway, with lazydownload) (2) 140.139 0.0249962 6044.94 2002.1 134.204 0.0263754 5784.75 1898.26 Echo (local gateway) (1) 560.677 0.00678703 9031.4 7367.29 Echo (local gateway) (2) 482.265 0.00788256 6810.49 6343.49 Echo (local gateway + proxy A) 204.94 0.0175442 7315.44 2850.26 Echo (local gateway + proxy B) (1) 185.221 0.0194803 6463.38 2567.13 Echo (local gateway + proxy B) (2) 185.949 0.0194002 6482.62 2577.73 Echo (local gateway + proxy C) (1) 189.796 0.0188336 6798.95 2655.01 Echo (local gateway + proxy C) (2) 180.042 0.0198329 6238.77 2521.27 Echo (local gateway + proxy D) (1) 171.915 0.0208115 5751.84 2402.68 Echo (local gateway + proxy D) (2) 185.37 0.0193491 6571.65 2584.44 Echo (local gateway + proxy E) (1) 186.836 0.0196245 6541.54 2548.29 Echo (local gateway + proxy E) (2) 184.155 0.0196972 6408.48 2538.67 XRootD caching tests- CMS PhaseIIFall16GS82 Conclusions: A proxy gives a significant performance boost Proxy parameters such as max2cache & pagesize have a negligible effect Lazy-download improves performance more significantly than a proxy For a single job: Notes: lazy-download is not used unless explicitly specified lcg1652.gridpp.rl.ac.uk used for testing - Centos 6 container on SL7 xrootd daemons running in containers with host networking proxy parameters (A): pss.cache debug 3 max2cache 4m pagesize 4m size 1g proxy parameters (B): pss.cache max2cache 32m pagesize 64m size 16g proxy parameters (C): pss.cache max2cache 32m pagesize 96m size 16g proxy parameters (D): pss.cache max2cache 16m pagesize 64m size 16g proxy parameters (E): pss.cache max2cache 8m pagesize 64m size 16g proxy parameters (F): pss.cache max2cache 32m pagesize 32m size 16g proxy parameters (G): pss.cache max2cache 32m pagesize 16m size 16g proxy parameters (H): pss.cache max2cache 32m pagesize 128m size 16g Echo (local gateway + proxy F) (1) 178.208 0.0200925 5990.55 2488.9 Echo (local gateway + proxy F) (2) 194.985 0.0189344 6938.31 2641.09 Echo (local gateway + proxy G) (1) 176.353 0.0205068 5860.05 2438.7 Echo (local gateway + proxy G) (2) 182.01 0.0197938 6168.35 2526.58 Echo (local gateway + proxy H) (1) 174.227 0.0204519 5696.38 2444.92 Echo (local gateway + proxy H) (2) 175.732 0.0202411 5989.54 2470.37