ATLAS & Google "Data Ocean" R&D Project

ATLAS & Google "Data Ocean" R&D Project Authors: Mario Lassnig (CERN), Karan Bhatia (Google), Andy Murphy (Google), Alexei Klimentov (BNL), Kaushik De (UTA), Martin Barisits (CERN), Fernando Barreiro (UTA), Thomas Beermann (CERN), Ruslan Mashinistov (UTA), Torre Wenaus (BNL), Sergey Panitkin (BNL) Project overview 2 Use cases 2 User analysis 2 Data placement, replication, and popularity 2 Data streaming 2 Work packages 3 WP1 - Data management 3 WP2 - Workflow management 3 WP3 - Google Cloud Storage Global Redirection 3 WP4 - Cost Model 3 Addendum 3 List of key personnel and PI s 3 Timeline 4 Resources from ATLAS and Google 4 ATL-SOFT-PUB-2017-002 29 December 2017 Objectives and key results 5 Namespace handling 5 Connecting ATLAS grid storage with Google storage for third-party-copy 5 Monitoring third-party-copy 6 Reading data from Google storage to Grid worker nodes File copy-to-scratch 7 Monitoring copy-to-scratch transfers 7 Reading data from Google storage to Grid worker nodes Streaming random-io 7 Monitoring random-io 7 Deletion of data on Google Storage 7 Reading data inside Google data centres Jobs running on Google compute 7 Network provisioning 8 Transparent global redirection between inter-regional zones on Google Cloud Storage 8 Development of an economic cost model 8 Appendix 1 Brainstorming document 8 Appendix 2 Group photo 10 Bibliography 10

Project overview ATLAS [1] is facing several challenges with respect to their computing requirements for LHC [2] Run-3 (2020-2023) and HL-LHC runs (2025-2034). The challenges are not specific for ATLAS or/and LHC, but common for HENP computing community. Most importantly, storage continues to be the driving cost factor and at the current growth rate cannot absorb the increased physics output of the experiment. Novel computing models with a more dynamic use of storage and computing resources need to be considered. This project aims to start an R&D project for evaluating and adopting novel IT technologies for HENP computing. ATLAS and Google plan to launch an R&D project to integrate Google cloud resources (Storage and Compute) to the ATLAS distributed computing environment. After a series of teleconferences, a face-to-face brainstorming meeting in Denver, CO at the Supercomputing 2017 conference resulted in this proposal for a first prototype of the "Data Ocean" project. The idea is threefold: (a) to allow ATLAS to explore the use of different computing models to prepare for High-Luminosity LHC, (b) to allow ATLAS user analysis to benefit from the Google infrastructure, and (c) to give Google real science use cases to improve their cloud platform. Use cases User analysis When analysts use the distributed analysis services to run on the grid, the outputs are deposited on the grid. Making 100% of those outputs available to the analyst quickly is a difficult problem and remains one of the weak points of distributed analysis. Through this R&D, analysis outputs generated in worker nodes around the world could be directed to Google Cloud Storage, where they become uniformly and reliably available to the analyst anywhere in the world. Analysis data products are small and the GCS-resident outputs could be regarded as a cache with a limited lifetime, and thus limited storage footprint, while the value of reliable accessibility of this hot data to analysts would be enormous. Data placement, replication, and popularity The final stages of data analysis by users require access to multi-petabyte of data storage. To ensure high level of access, ATLAS replicates multiple copies of this data to worldwide computing resources. The Google Cloud Storage service could be an alternative to these highly used data formats. We plan to store the final derivation of the full ATLAS MC or/and reprocessing data campaigns. This data will then be available to users worldwide through Google Compute and ATLAS Compute resources. Data streaming ATLAS Computing is investigating the use of sub-file data products in the analysis chain. A prototype of this "Event Streaming Service" is currently in development and could benefit from fine-grained cloud storage. This use case will evaluate the necessary compute to generate the sub-file data

products ("events") from their original files at the scale required by HL-LHC, and the performance gains of highly parallel small size data delivery to the analysis software. Work packages The proof-of-concept phase of the "Data Ocean" project will consist of four major parts (Work Packages - WP). We envision that these packages will have well defined common milestones and overlaps. Both ATLAS and Google will commit software engineering effort to this project, initially at the level of 3 FTE s total. The expected official project start is early 2018. Additional partners from US National Laboratories and Universities, and CERN/WLCG are likely to join this project. WP1 - Data management This work package connects Google Cloud Storage with the ATLAS Data Management system "Rucio" [3], which will allow writing a full multi-petabyte physics sample to Google Cloud Storage. By taking advantage of Google's and ESnet fast networks, the sample is then distributed by Google between their continental regions/zones and made available to ATLAS Compute across the globe. ATLAS and Google will work together to understand data popularity and cache the most popular physics data vs geographical access pattern. WP2 - Workflow management ATLAS user analysis jobs, brokered by the "PanDA" workflow management system [4], should be able to run using either file-copy or direct-io with Google Cloud Storage. A strategy of using container formats for user analysis jobs will be developed. In addition, this work package will involve running jobs on Google Compute Platform, accessing either data from ATLAS storage or Google Cloud Storage. WP3 - Google Cloud Storage Global Redirection The third work package will involve an improvement to Google Cloud Storage itself. Right now, the ATLAS jobs needs to retain knowledge which Google Cloud region is to be used. Google will implement a global redirection between their regions to expose Google Cloud Storage as a single global entity. WP4 - Cost Model The fourth work package will deal with the economic model necessary for sustainable commercial clouds resource usage. For example, using adaptive pricing for cloud resource costs (storage, compute, network). Addendum List of key personnel and PI s Google

ATLAS Karan Bhatia Andy Murphy BNL Alexei Klimentov Torre Wenaus Sergey Panitkin CERN Mario Lassnig Martin Barisits Thomas Beermann Tobias Wegner UTA Kaushik De Fernando Barreiro Ruslan Mashinistov Project Management The project will be managed jointly by Google and ATLAS PIs Progress will be reported and followed on weekly basis Two Technical Interchange Meetings will be organized during duration of the project : once by Google, once by ATLAS Timeline The expected official project start is early 2018. X+1 month: detailed objectives and key results description X+2 month: test ATLAS/Google data transfer X+3 month: test ATLAS/Google analysis jobs access X+4 month: Full ATLAS derived data replica stored by Google X+6 month: End-user analysis test X+8 month: commissioning and pre-production for ATLAS selected users Resources from ATLAS and Google Both ATLAS and Google will commit software engineering effort to this project, initially at the level of 3 FTEs total. It will be highly desirable to have a Google SW engineer at CERN to work together with Rucio and PanDA teams during PoC implementation and commissioning phase. Additional partners from US National Laboratories and Universities, and CERN/WLCG are likely to join this project. Google computing resources (storage, bandwidth and CPUs) estimation for PoC phase will be done in one month after project will be launched and WP approved by both parties.

Objectives and key results These OKRs are only loosely coupled and should be doable in parallel after the two initial steps ("Namespace handling" and "Connecting grid storage") are finished. Names and ETAs are tentative and subject to official project start. Namespace handling Google Storage would become a new endpoint for ATLAS Be able to address all derived MC and processing campaign data Set up Google storage authorisation, authentication Add Google storage hosts to the ATLAS topology system (AGIS) Synchronise topology with ATLAS data management system (Rucio) There should be two available buckets, one in the US (Available: 100G Chicago, 10G San Jose, 10G Ashburn; Coming: 100G NY, 100G Seattle) and one in the EU (no ESnet peering with Google). Rucio Data Identifiers (DIDs) are a globally unique tuple <Scope:Name> E.g., mc16_13tev:12345.hits.pool.root Have associated collection of metadata, e.g., project, datatype, #events Can be either file, dataset (collection of files), container (collection of datasets) Unique among all three categories, cannot be reused We put replication rules on DIDs (declarative data management, e.g., 3 copies of this DID, one must be on tape and all should be on different continents) Resolve DIDs to files to actual replicas (root://hostname/storage/file.123) RSE (Rucio Storage Element) Unique logical unit of data storage Has different attributes, e.g., is_tape, geoip, We have topological split between the endpoint name (e.g., CERN-PROD_SCRATCHDISK) and the associated hosts behind the name (which could be many, each with a different protocol). So, e.g., we could have GCS_EUROPE, GCS_USEAST, GCS_USWEST, (and once the GCS global redirector exists, just a single GCS). each one would then have an associated storage endpoint: gcs://bucket/.. (or more likely) s3://bucket/... Connecting ATLAS grid storage with Google storage for third-party-copy We can transfer to/from Google storage using our orchestrated mechanisms in Rucio

Verify Rucio transfertool implementation for S3 compatibility with Google RSE Implement changes to Rucio transfertool if necessary Set rules for DIDs on grid storage to create replicas on Google RSE Set rules for DIDs on Google RSE to create replicas on grid storage Proposed input volume is between 1-6 Petabytes. For two possible scenarios: 1PB of NTUP for end-user analysis only 4-5PB for a complete copy of derivation data for one campaign The current full analysis produces roughly ~500'000 files with ~40MB each per day, equalling a growth rate of ~20TB/day. For the proof-of-concept it should be sufficient if a small percentage of the jobs (<1%) can be rerouted to write their output to GCS (5000 files, <200GB per day growth rate). Google has network peering with ESnet, which has connections to several ATLAS Tier-1 and Tier-2 centres in the US and Europe. The connections to US ATLAS sites are very good, whereas the EU peerings are less reliable. BNL might serve as a bridge for EU transfers if necessary. Network monitoring should be considered, especially for the ESnet peering, e.g., using perfsonar. Rucio has a multi-queue transfer system (conveyor + transfertool) Conveyor decides which transfer requests to take off the queue and process Transfertool submits transfer requests to third-party-copy component FTS supports WebDAV to S3 push-third-party copy from DPM and dcache Receives acknowledgements and polls status of transfers Updates DIDs, replicas, rules, does the retries, etc. Monitoring third-party-copy We are able to understand the performance differences between our existing transfer infrastructure and Google Storage Ensure instrumentation events are properly forwarded to monitoring system Create dedicated dashboards We ship all our transfer events into HDFS and ElasticSearch Dashboards, compute durations, historical views, accounting, etc.. Also source for our analytics system, e.g. to estimate transfer-time-to-complete using machine-learning Most important metrics #files/second transferred and deleted mbps per file and per link space usage over time

Reading data from Google storage to Grid worker nodes File copy-to-scratch Jobs can download full input files for processing using rucio-clients Access protocols might differ from third-party-copy If new protocols are needed they can be implemented Monitoring copy-to-scratch transfers We can follow the job transfers with our existing monitoring Every job sends a trace for every files they access. A trace is a dictionary containing information like location of the file, timestamps (start of the copy, end of the copy)... These traces are used : To build the popularity of our data To monitor the volume processed by the jobs Reading data from Google storage to Grid worker nodes Streaming random-io It might be necessary to add the Google Cloud Network to LHCone Monitoring random-io Deletion of data on Google Storage Allow the deletion of data on Google Cloud Storage using Rucio Reading data inside Google data centres Jobs running on Google compute

Network provisioning Ensure that full-capacity network is used for data ingress from grid storage to Google Cloud Storage Ensure that jobs running in Google Cloud Compute do not overwhelm our research networks Transparent global redirection between inter-regional zones on Google Cloud Storage Retrieve a file from Google Cloud Storage using a unique identifier regardless which region/zone was used for initial data ingress Development of an economic cost model Control the cost of ATLAS data on Google Cloud Storage Control the cost of ATLAS jobs on Google Cloud Compute Appendix 1 Brainstorming document

Full resolution: https://drive.google.com/open?id=1os6zn1c1n1xocyy-a_mmw71mdtqzsnks https://drive.google.com/open?id=1ayylhvgiiv5h2_aeyqb-yoisdxeqw75m

Appendix 2 Group photo Left to right: Karan Bhatia, Alexei Klimentov, Horst Severini, Kaushik De, Thomas Beermann, Mario Lassnig, Sergey Panitkin, Ruslan Mashinistov, Martin Barisits, Fernando Barreiro, Matteo Turilli Full resolution: https://drive.google.com/open?id=1uqutcaga0rboljavoty6ncohj0nyds13 Bibliography [1] ATLAS Collaboration, G Aad, et al. The ATLAS experiment at the CERN large hadron collider. J.Instrum, 3:S08003, 2008. [2] LHC The Large Hadron Collider. http://lhc.web.cern.ch/lhc/. [3] Rucio [4] T.Maeno P.Nilsson K.De, A.Klimentov and T.Wenaus. PanDA Production and Analysis backend. Journal of Physics, vol. 219, 210, 2009.