WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Similar documents
Monitoring of large-scale federated data storage: XRootD and beyond.

An Analysis of Storage Interface Usages at a Large, MultiExperiment Tier 1

Monitoring WLCG with lambda-architecture: a new scalable data store and analytics platform for monitoring at petabyte scale.

Evolution of Database Replication Technologies for WLCG

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

PoS(EGICF12-EMITC2)106

The DMLite Rucio Plugin: ATLAS data in a filesystem

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

The ALICE Glance Shift Accounting Management System (SAMS)

CMS users data management service integration and first experiences with its NoSQL data storage

Designing and developing portable large-scale JavaScript web applications within the Experiment Dashboard framework

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

ATLAS Nightly Build System Upgrade

Data Transfers Between LHC Grid Sites Dorian Kcira

ATLAS software configuration and build tool optimisation

A Tool for Conditions Tag Management in ATLAS

Performance of popular open source databases for HEP related computing problems

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Security in the CernVM File System and the Frontier Distributed Database Caching System

Striped Data Server for Scalable Parallel Data Analysis

Streamlining CASTOR to manage the LHC data torrent

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Dynamic Federations. Seamless aggregation of standard-protocol-based storage endpoints

The High-Level Dataset-based Data Transfer System in BESDIRAC

AGIS: The ATLAS Grid Information System

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

CMS - HLT Configuration Management System

Andrea Sciabà CERN, Switzerland

The CMS data quality monitoring software: experience and future prospects

Using S3 cloud storage with ROOT and CvmFS

Analytics Platform for ATLAS Computing Services

Evolution of Cloud Computing in ATLAS

Evolution of Database Replication Technologies for WLCG

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

DIRAC pilot framework and the DIRAC Workload Management System

Understanding the T2 traffic in CMS during Run-1

Network Analytics. Hendrik Borras, Marian Babik IT-CM-MM

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

Resilient FTS3 service at GridKa

Automating usability of ATLAS Distributed Computing resources

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

Data transfer over the wide area network with a large round trip time

Benchmarking the ATLAS software through the Kit Validation engine

Experiences with the new ATLAS Distributed Data Management System

The TDAQ Analytics Dashboard: a real-time web application for the ATLAS TDAQ control infrastructure

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

CouchDB-based system for data management in a Grid environment Implementation and Experience

Data services for LHC computing

Federated Data Storage System Prototype based on dcache

Towards Monitoring-as-a-service for Scientific Computing Cloud applications using the ElasticSearch ecosystem

Using Puppet to contextualize computing resources for ATLAS analysis on Google Compute Engine

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

Distributing storage of LHC data - in the nordic countries

CernVM-FS beyond LHC computing

The ATLAS Distributed Analysis System

CERN European Organization for Nuclear Research, 1211 Geneva, CH

ANSE: Advanced Network Services for [LHC] Experiments

MSG: An Overview of a Messaging System for the Grid

The Legnaro-Padova distributed Tier-2: challenges and results

Summary of the LHC Computing Review

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

The LHC Computing Grid

A distributed tier-1. International Conference on Computing in High Energy and Nuclear Physics (CHEP 07) IOP Publishing. c 2008 IOP Publishing Ltd 1

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

Overview of ATLAS PanDA Workload Management

The ATLAS EventIndex: Full chain deployment and first operation

An SQL-based approach to physics analysis

Xrootd Monitoring for the CMS Experiment

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Federated data storage system prototype for LHC experiments and data intensive science

FTS3 a file transfer service for Grids, HPCs and Clouds

Improved ATLAS HammerCloud Monitoring for Local Site Administration

Identity federation in OpenStack - an introduction to hybrid clouds

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

T0-T1-T2 networking. Vancouver, 31 August 2009 LHCOPN T0-T1-T2 Working Group

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

PROOF-Condor integration for ATLAS

The coolest place on earth

Hybrid WAN Operations: Extend Network Monitoring Across SD-WAN and Legacy WAN Infrastructure

A framework to monitor activities of satellite data processing in real-time

DIRAC data management: consistency, integrity and coherence of data

Grid Computing at the IIHE

Grid Architectural Models

ATLAS Tracking Detector Upgrade studies using the Fast Simulation Engine

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

Tier 3 batch system data locality via managed caches

ATLAS operations in the GridKa T1/T2 Cloud

HammerCloud: A Stress Testing System for Distributed Analysis

Analysis and improvement of data-set level file distribution in Disk Pool Manager

Recent developments in user-job management with Ganga

Vortex Whitepaper. Intelligent Data Sharing for the Business-Critical Internet of Things. Version 1.1 June 2014 Angelo Corsaro Ph.D.

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

Transcription:

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers. J Andreeva 1, A Beche 1, S Belov 2, I Kadochnikov 2, P Saiz 1 and D Tuckett 1 1 CERN (European Organization for Nuclear Research) 2 JINR (Joint Institute for Nuclear Research) Abstract. The Worldwide LHC Computing Grid provides resources for the four main virtual organizations. Along with data processing, data distribution is the key computing activity on the WLCG infrastructure. The scale of this activity is very large, the ATLAS virtual organization (VO) alone generates and distributes more than 40 PB of data in 100 million files per year. Another challenge is the heterogeneity of data transfer technologies. Currently there are two main alternatives for data transfers on the WLCG: File Transfer Service and XRootD protocol. Each LHC VO has its own monitoring system which is limited to the scope of that particular VO. There is a need for a global system which would provide a complete cross-vo and cross-technology picture of all WLCG data transfers. We present a unified monitoring tool - WLCG Transfers Dashboard - where all the VOs and technologies coexist and are monitored together. The scale of the activity and the heterogeneity of the system raise a number of technical challenges. Each technology comes with its own monitoring specificities and some of the VOs use several of these technologies. This paper describes the implementation of the system with particular focus on the design principles applied to ensure the necessary scalability and performance, and to easily integrate any new technology providing additional functionality which might be specific to that technology. 1. Introduction The Worldwide LHC Computing Grid [1] (WLCG) is a global collaboration between 150 computer centres geographically distributed over the world. The LHC experiments generate 25PB of new data every year and more than 200PB of data are already stored. In the meantime, the computing models of the experiments is evolving from hierarchical data models with centrally managed data pre-placement towards federated storage. Today, the WLCG provides data transfer services for the four main virtual organizations (VOs in the following) using two key technologies: File Transfer Service (FTS) [2] and XRootD [3]. Monitoring them is primary to get a complete picture of the data movements and understand the on-going traffic. In this paper, the WLCG Transfers Dashboard is presented. It is a global system which provides a complete cross-vo and cross-technology picture of data transfer activity on the Grid. First, the initial architecture will be presented. Then, the evolution of the requirements will be described. To meet these requirements, the new architecture of the WLCG Transfers Dashboard will be discussed. Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by IOP Publishing Ltd 1

2. Initial needs and architecture The WLCG is a complex environment shared by the 4 LHC VOs. In the early stage of the WLCG Transfers Dashboard, the focus was on FTS only, a technology shared by ATLAS, CMS and LHCb. While ALICE was missing, the picture given by monitoring FTS was a good approximation of the global traffic for the other VOs. This section will focus on the initial architecture of the WLCG Transfers Dashboard as described by Figure 1. Figure 1. Workflow of monitoring data. Every single FTS server has been instrumented to send a monitoring message each time a transfer is starting or finishing (successfully or not). Initially, there were 13 FTS servers sending monitoring data and this number is growing today along with the deployment of the new FTS3 service. To handle this monitoring dataflow, a reliable, yet modular and scalable, transport layer is required. The message broker ActiveMQ[4] has been chosen for its advanced features as well as its wide range of supported programming languages. While allowing close to real-time consuming of the stored messages, ActiveMQ can also act as a buffer in front of the database and the messages can be consumed asynchronously dramatically improving the reliability of the monitoring chain. The raw monitoring data are then consumed by a collector which persistently stores them in a relational database. To improve the performance of the web application, the raw data are aggregated into statistics with different time period granularities (10 minutes for queries requesting less than 2 days of data, 24 hours for the others). All database access goes through a well-defined API providing data in different humanreadable formats: JSON and XML. The web API consists of a Python application deployed on top of an Apache web server and provides a comprehensive list of possible actions (backed by queries to the database) ensuring re-usability and consistency of the data between different applications. Finally, end-users access the application through a web interface from which they can parameterize their queries and get the results in different representations such as bar-chart, matrix or on a world-map. This architecture has been partially inherited from the ATLAS DDM Dashboard [5] and is generic enough to be shared by other monitoring tools (today the same code is shared between 5 applications). However, the computing model of the experiments is evolving and new technologies are joining the landscape of data transfer services bringing new requirements. 2

3. Evolution of the requirements The first version of the WLCG Transfers Dashboard enabled monitoring of the data transfers handled by FTS. Three LHC experiments (ATLAS, CMS and LHCb) use FTS as the main tool for managing data transfers. ALICE relies on XRootD technology both for data transfers and data access. Therefore the WLCG Transfers Dashboard could not provide a complete picture of WLCG transfers unless XRootD monitoring was integrated into the system. Now, other LHC experiments such as ATLAS and CMS have shown interest in the XRootD federation and have started to implement their own: Federated ATLAS XRootD (FAX) [7] project and Any data, Anytime, Anywhere (AAA) [6] federation for CMS. As shown in Figure 2, over the last month, more than 20% of the overall WLCG traffic is going through XRootD. FTS XRootD Volume moved ALICE 0% 100% 177TB ATLAS 89% 11% 17PB CMS 63% 37% 10.3PB LHCb 100% 0% 495TB Figure 2. Breakdown of the traffic per technology based on the last month s data. The increase in XRootD traffic has generated two new main requirements from the data traffic monitoring perspective. On the first hand, the administrators of an XRootD federation need a specific monitoring tool [8] with technology-specific features and the finest information granularity possible which includes local and remote data access. On the other hand, all the WLCG traffic should be correlated in a single cross-vo and cross-technology user interface. To meet these requirements, the monitoring traffic of XRootD is multi-casted downstream of the message broker as shown in the Figure 3. Figure 3. Workflow of monitoring data. At that point, data transfers monitoring consists of three distinct instances of monitoring services: the AAA and FAX Dashboards (respectively for XRootD ATLAS and CMS federations) and the WLCG Transfers Dashboard for global traffic. Consequently, the XRootD traffic monitoring data are duplicated at the database level causing potential inconsistencies at the user interface level. The following section will describe a novel architecture which consists of having a top level user interface which retrieves and aggregates data from underlying monitoring tools. 3

4. Towards a federated dashboard The previously described architecture is a good fit for technology-specific monitoring dashboards. However, limitations are quickly reached in a heterogeneous multi-technology environment. This is why WLCG Transfers monitoring has evolved from a single system with a central DB to well-defined technology-specific applications aggregated together in a top level monitoring view. Figure 4 shows the new database-less architecture adopted for this top level view, it consists of an application which makes HTTP calls to others dashboards to retrieve the data and aggregate them together. The results are then displayed in the web user interface based on Dashboard in-house MVC µ-framework xbrowse [10] shared to a large extent with other data management monitoring applications. To improve performance, a caching layer based on Memcached [9] has been added to the web server in front of the database. Figure 4. Federated Dashboard architecture. There are numerous advantages to this new design. Firstly, all technologies and VOs can now be correlated as shown in Figure 5 but also EOS traffic for ATLAS and CMS is now available in the global picture (it was not included in the previous WLCG Transfers Dashboard for scalability reasons). Secondly, the consistency is guaranteed by design as the raw data are not duplicated anymore. Finally, integration of others technologies which will be used in the future such as HTTP/WebDAV Dynamic Federation [11] will be greatly simplified. Figure 5. Correlation of the traffic per VO and technology. 4

5. Conclusion The WLCG Transfers Dashboard is a cross-vo and cross-technology data transfers monitoring system used daily by more than 80 distinct users. Its architecture has evolved, moving from a single system with a central DB to a federated model combining technology-specific instances of monitoring services. The new architecture provides better scalability and performance. For example, it allows the integration of EOS data (which are generated at a peak of 800 Hz) within the global picture. It also provides a clean isolation of the workflows and facilitates the integration of future data transfer technologies. Acknowledgement We would like to thank our colleagues Domenico Giordano, Michail Salichos, Matevz Tadel and Ilija Vukotic who greatly contributed by enabling monitoring data flows for FTS and XRootD. References [1] WLCG web page: http://wlcg.web.cern.ch/ [2] FTS3 web page: https://svnweb.cern.ch/trac/fts3 [3] XRootD web page: http://www.xrootd.org/ [4] ActiveMQ web page: http://activemq.apache.org/ [5] DDM Dashboard: http://dashb-atlas-data.cern.ch [6] AAA project page: https://twiki.cern.ch/twiki/bin/view/main/cmsxrootdarchitecture [7] FAX project page: https://twiki.cern.ch/twiki/bin/view/atlascomputing/atlasxrootdsystems [8] CHEP 2013 poster: Monitoring of large-scale federated data storage: XRootD and beyond [9] Memcached web page: http://memcached.org/ [10] Andreeva J et al, Designing and developing portable large-scale JavaScript web applications within the Experiment Dashboard framework, 2012 J. Phys.: Conf. Ser. 396 052069 [11] Furano F, Brito da Rocha R, Devresse A, Keeble O, Alvarez Ayllon A and Fuhrmann P, The Dynamic Federations: Federate Storage on the fly using HTTP/WebDAV and DMLite, 2013 The International Symposium on Grids and Clouds 5