arxiv: v1 [cs.dc] 12 May 2017

Similar documents
Popularity Prediction Tool for ATLAS Distributed Data Management

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

How the Monte Carlo production of a wide variety of different samples is centrally handled in the LHCb experiment

A Popularity-Based Prediction and Data Redistribution Tool for ATLAS Distributed Data Management

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

LHCbDirac: distributed computing in LHCb

Data Transfers Between LHC Grid Sites Dorian Kcira

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

The CMS Computing Model

Experiences with the new ATLAS Distributed Data Management System

The High-Level Dataset-based Data Transfer System in BESDIRAC

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Andrea Sciabà CERN, Switzerland

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Exploiting CMS data popularity to model the evolution of data management for Run-2 and beyond

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

CERN s Business Computing

DIRAC distributed secure framework

Overview of ATLAS PanDA Workload Management

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

Clustering with gradient descent

Volunteer Computing at CERN

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Machine learning algorithms for datasets popularity prediction

Progress in Machine Learning studies for the CMS computing infrastructure

The CMS data quality monitoring software: experience and future prospects

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

The Compact Muon Solenoid Experiment. Conference Report. Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

PoS(ISGC 2016)035. DIRAC Data Management Framework. Speaker. A. Tsaregorodtsev 1, on behalf of the DIRAC Project

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Machine Learning analysis of CMS data transfers

Mining Predictive Models for Big Data Placement

Monte Carlo Production on the Grid by the H1 Collaboration

High-Energy Physics Data-Storage Challenges

Early experience with the Run 2 ATLAS analysis model

ATLAS DQ2 to Rucio renaming infrastructure

AGIS: The ATLAS Grid Information System

LHCb Computing Resource usage in 2017

ANSE: Advanced Network Services for [LHC] Experiments

RUSSIAN DATA INTENSIVE GRID (RDIG): CURRENT STATUS AND PERSPECTIVES TOWARD NATIONAL GRID INITIATIVE

High Performance Computing Course Notes Grid Computing I

CMS users data management service integration and first experiences with its NoSQL data storage

Performance of popular open source databases for HEP related computing problems

Experience with ATLAS MySQL PanDA database service

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

Towards Network Awareness in LHC Computing

CouchDB-based system for data management in a Grid environment Implementation and Experience

Computing at Belle II

Data Management for the World s Largest Machine

The DMLite Rucio Plugin: ATLAS data in a filesystem

Evolution of Database Replication Technologies for WLCG

ATLAS Data Management Accounting with Hadoop Pig and HBase

PoS(ISGC2015)008. Exploring patterns and correlations in CMS Computing operations data with Big Data analytics techniques

arxiv: v1 [cs.dc] 20 Jul 2015

ATLAS & Google "Data Ocean" R&D Project

LHCb Computing Resources: 2018 requests and preview of 2019 requests

Data preservation for the HERA experiments at DESY using dcache technology

High Performance Computing on MapReduce Programming Framework

Storage Resource Sharing with CASTOR.

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

DIRAC File Replica and Metadata Catalog

Virtualizing a Batch. University Grid Center

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

DIRAC data management: consistency, integrity and coherence of data

Grid Computing: dealing with GB/s dataflows

PoS(EGICF12-EMITC2)106

Caching Algorithm for Content-Oriented Networks Using Prediction of Popularity of Content

LHCb Computing Strategy

Computing at the Large Hadron Collider. Frank Würthwein. Professor of Physics University of California San Diego November 15th, 2013

CMS data quality monitoring: Systems and experiences

5 Fundamental Strategies for Building a Data-centered Data Center

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

LHCb Distributed Conditions Database

Distributed Data Management with Storage Resource Broker in the UK

PoS(ISGC2015)005. The LHCb Distributed Computing Model and Operations during LHC Runs 1, 2 and 3

ATLAS software configuration and build tool optimisation

Federated data storage system prototype for LHC experiments and data intensive science

SF-LRU Cache Replacement Algorithm

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

Using S3 cloud storage with ROOT and CvmFS

IEPSAS-Kosice: experiences in running LCG site

CSCS CERN videoconference CFD applications

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Open data and scientific reproducibility

N. Marusov, I. Semenov

LRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017

ilcdirac and Continuous Integration: Automated Testing for Distributed Computing

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Understanding the T2 traffic in CMS during Run-1

Streamlining CASTOR to manage the LHC data torrent

DIRAC Distributed Infrastructure with Remote Agent Control

Ideas for evolution of replication CERN

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Long Term Data Preservation for CDF at INFN-CNAF

A new petabyte-scale data derivation framework for ATLAS

ATLAS computing activities and developments in the Italian Grid cloud

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

Transcription:

GRID Storage Optimization in Transparent and User-Friendly Way for LHCb Datasets arxiv:1705.04513v1 [cs.dc] 12 May 2017 M Hushchyn 1,2, A Ustyuzhanin 1,3, P Charpentier 4 and C Haen 4 1 Yandex School of Data Analysis, Moscow, Russia 2 Moscow Institute of Physics and Technology, Dolgoprudny, Russia 3 National Research University Higher School of Economics, Moscow, Russia 4 CERN, Geneva, Switzerland E-mail: mikhail.hushchyn@cern.ch Abstract. The LHCb collaboration is one of the four major experiments at the Large Hadron Collider at CERN. Many petabytes of data are produced by the detectors and Monte-Carlo simulations. The LHCb Grid interware LHCbDIRAC [1] is used to make data available to all collaboration members around the world. The data is replicated to the Grid sites in different locations. However the Grid disk storage is limited and does not allow keeping replicas of each file at all sites. Thus it is essential to optimize number of replicas to achieve a better Grid performance. In this study, we present a new approach of data replication and distribution strategy based on data popularity prediction. The popularity is performed based on the data access history and metadata, and uses machine learning techniques and time series analysis methods. 1. Introduction Data replication strategy is essential for the grid performance. On the one hand, too small number of replicas puts high load on the network, leads to non-optimal usage of the computing resources and large job wall-clock time. On the other hand, too large number of replicas requires a lot of disk space at grid sites to keep all these datasets. CERN experiments have explored how to predict data popularity using machine learning and use it for the data replication strategy. ATLAS uses Artificial Neural Networks (ANN) to predict number of accesses to the dataset for the next week [2]. Access frequency represents the popularity of the dataset. One replica is then removed for unpopular datasets and new replica is created for a popular dataset once a week. CMS predicts dataset popularity in up-coming week using classifiers. In this study, several popularity definitions were explored [3]. Also, different classifiers were used to predict the popularity. LHCb data management system (DMS) [4] uses a static replication strategy: a dataset has a given number of replicas during its life-time. This number can be changed by the data manager based on the knowledge of the storage occupancy and the needs for the free space. However this is not always done in an optimal way. The first study of using machine learning for the data replication optimization in LHCb [5] was done in 2015. The approach presented in this paper is an improved and redesigned evolution of the previous one. In the new model, both long-term and short-term data popularity

Figure 1. General concept of the study predictions are considered. The predictions help to detect when the number of replicas of a dataset can be increased or decreased. 2. General Concept LHCb has more than 15 PB of data on disk. Driven by LHCb DMS [4], each dataset is replicated on disk and also has one archival replica on tape. The model presented here is designed in a cache algorithms way [6]. These algorithms use specific metrics to select the most appropriate element to be removed from the cache. For example, the Least Recently Used (LRU) algorithm [6] removes the least recently used element from cache, but the Least Frequently Used (LFU) algorithm [6] removes the least frequently used one. Particular metrics used in our model will be considered in sections 3 and 5. The study is based on the dataset access history for the last 2.5 years and its metadata. The general schema of the model is shown in Figure 1. The model provides long- and short-term predictions of the data popularity. First machine learning is used to predict the probability that a dataset will be accessed in the long-term future i.e. in the next 6 months or a year. Then a time series analysis is used to predict the number of accesses to the dataset in the short-term future: one week, one month. Then, the metrics for the replication strategy are calculated based on these predictions. This metric determines which replica or whole dataset should be removed from disk first. 3. Long-term Prediction Each dataset is described by the following features: recency, reuse distance i.e. time between the last and the second last usages, time of the first access, creation time, access frequency, type, extension and size. Some of them are taken from the datasets metadata, others are extracted from the access history. These features are used to predict the probability that the dataset will be accessed during the next 6 months. The probability is predicted using the Random Forest Classifier. The classifier training includes two steps. In the first step, the dataset feature values are estimated using the first 1.5 out of the last 2.5 years of the access history. In the second step, the next 6 months of the history are used to label the datasets. A dataset will be labeled as popular if it is accessed during these 6 months, and is labeled as unpopular otherwise. Then, the classifier is trained to separate the popular datasets from the unpopular ones using these features.

Figure 2. The dataset features importances Figure 3. The dependence of the amount of the saved disk space from the mistakes. The classifier quality is evaluated in the following way. New feature values of the datasets are estimated using the first 2 out of the last 2.5 years of the access history. According to these values, the classifier predicts the probability that a dataset will be popular. Then, the dataset true labels are extracted from the next 6 months of data history. Obtained predictions and the true labels are used to measure the classifier quality. Finally, the classifier is retrained using the first 2 years of the history and uses the entire history to predict the probabilities for the next 6 months in the (unknown) future in the way described above. Each dataset feature has a different importance for this two class separation. The Random Forest Classifier allows to evaluate this contribution. Figure 2 shows feature importances as estimated by the classifier. The figure shows that recency is the most important feature in separating popular datasets from the unpopular ones. The LRU algorithm is therefore chosen as a reference for comparison with the classifier. The predicted probability can be used as a specific metric to save disk space. When it is needed to save some space, a dataset with the smallest access probability will be removed first. A mistake may occur, when a dataset is removed but is accessed later. In this case, this dataset should be restored to disk from its tape archive. The quality of the approach is shown in Figure 3. The figure demonstrates that the classifier allows to save the same amount of the disk space with up to 2 times smaller level of mistakes compared to the LRU method. 4. Short-term Forecast In this part of the study, the time series analysis is used to predict the number of accesses to the dataset during the coming 4 weeks. Brown s simple exponential smoothing model is chosen for that. The model is defined as: ŷ t+1 = ŷ t + α(y t ŷ t ) (1) α = argmin t (ŷ t y t ) 2, α [0, 1] (2) where: ŷ 0 = 1 n y t (3) n t=0

Figure 4. Short-term forecast quality Figure 5. Cumulative distribution function of occupied disk space in LHCb from the dataset number of replicas ŷ t+1 : is the predicted number of accesses at time-step t + 1 y t : is the number of accesses at time-step t ŷ t : is the predicted number of accesses at time-step t α: is a smoothing coefficient ŷ 0 : is the initial value of the prediction The model has two extreme cases. The first one is an average model which corresponds to α = 0. In this case, the prediction at the next time-step is the average of the values at all previous time-steps. When α = 1 the model degenerates into a static one: the predicted value at the next time-step equals to the value at the current time-step. The actual Brown s model is somewhere in between the average and the static extremes. The model has only one adjustable parameter. As a result it works well with short and sparse time series. Moreover, Brown s model is simple to interpret. The figure 4 represents the quality of the model on used data. Predictions and corresponding true values demonstrate a high correlation of 0.92. The correlation for the static model is 0.86. 5. Replication Strategy Based on the results of the short-term forecast the following metric is computed for each dataset: where: M = ŷt curr+1 n replicas (4) ŷ tcurr+1: is the predicted number of accesses at the next time-step n replicas : is a dataset number of replicas This metric is used to adjust the number of dataset replicas. A replica for a dataset with the minimum metric value is removed first when it is required to free some disk space. The very last replica of the datasets is not removed and stay on disk. After each deletion the M value is recalculated and the procedure is repeated. When there is some available free disk space, a new

replica for the dataset with the maximum metric value is added first. Then, the metric value is recalculated and the procedure is repeated. Figure 5 represents the cumulative distribution of occupied disk space in LHCb for different number of replicas. The function shows that the largest part of the space actually corresponds to datasets with 3 and 4 replicas. Thus, reducing number of replicas for them does allow saving petabytes of disk space when it is needed. Generally, the strategy described above allows to use the disk space more optimally. However once or twice per year it makes sense to remove also the very last replicas for those datasets which are the most unpopular based on the long-term predicted probabilities as described in Section 3. This saves the disk space occupied by the datasets which most likely will never be accessed in the future. 6. Conclusion The paper presents the approach of how machine learning and time series analysis can be used in data replication strategy. Forecast of the number of accesses of datasets allows to optimize the number of replicas. The prediction of the probability for the dataset to be used in longterm future helps to remove the least popular data from disks. These methods outperform conventional ones such as static model and LRU, and helps to use disk space more optimally with lower risk of mistakes and data loss. 7. Acknowledgments The authors are grateful to Denis Derkach and Fedor Ratnikov for their helpful comments during the work on this study and help with preparing this paper. References [1] Stagni F, etc. 2012 LHCbDirac: distributed computing in LHCb J. Phys.: Conf. Series 396 032104 (IOP Publishing) [2] Beermann T, Stewart G A and Maettig P 2014 The International Symposium on Grids and Clouds (ISGC) 2014: A Popularity-Based Prediction and Data Redistribution Tool for ATLAS Distributed Data Management [3] Kuznetsov V, Li T, Giommi L, Bonacorsi D and Wildish T 2016 Predicting dataset popularity for the CMS experiment J. Phys.: Conf. Series 762 012048 (IOP Publishing) [4] Baud J P, Charpentier P, Ciba K, Graciani E, Lanciotti E, Mth1 Z, Remenska D and Santana R 2012 The LHCb Data Management System J. Phys.: Conf. Series 396 032023 [5] Hushchyn M, Charpentier P and Ustyuzhanin A 2015 Disk storage management for LHCb based on Data Popularity estimator J. Phys.: Conf. Series 664 042026 (IOP Publishing) [6] Saemundsson T 2012 An experimental comparison of cache algorithms http://en.ru.is/media/skjol-td/ cache_comparison.pdf