Forecasting Model for Network Throughput of Remote Data Access in Computing Grids

Size: px

Start display at page:

Download "Forecasting Model for Network Throughput of Remote Data Access in Computing Grids"

Reynold Bryan
5 years ago
Views:

1 Forecasting Model for Network Throughput of Remote Data Access in Computing Grids Volodimir Begy CERN Geneva, Switzerland Martin Barisits CERN Geneva, Switzerland Mario Lassnig CERN Geneva, Switzerland Erich Schikuta University of Vienna Vienna, Austria ATL-SOFT-PROC /09/2018 Abstract Computing grids are one of the key enablers of escience. Researchers from many fields (e.g. High Energy Physics, Bioinformatics, Climatology, etc.) employ grids to run computational jobs in a highly distributed manner. The current state of the art approach for data access in the grid is data placement: a job is scheduled to run at a specific data center, and its execution starts only when the complete input data has been transferred there. This approach has two major disadvantages: (1) the jobs are staying idle while waiting for the input data; (2) due to the limited infrastructure resources, the distributed data management system handling the data placement, may queue the transfers up to several days. An alternative approach is remote data access: a job may stream the input data directly from storage elements, which may be located at local or remote data centers. Remote data access brings two innovative benefits: (1) the jobs can be executed asynchronously with respect to the data transfer; (2) when combined with data placement on the policy level, it may help to optimize the network load grid-wide, since these two data access methodologies partially exhibit nonoverlapping bottlenecks. However, in order to employ such a technique systematically, the properties of its network throughput need to be studied carefully. This paper presents results of experimental identification of parameters influencing the throughput of remote data access, a statistically tested mathematical formalization of these parameters, and, finally, an Auto-Regressive Integrated Moving Average (ARIMA) based forecasting model. In the context of a given virtual link between a storage element and a worker node, the model predicts the future relationship between the amount of data to be transferred and the amount of concurrent traffic as the independent variables and transfer time as the dependent variable. The purpose of the model is to assist various stakeholders of the grid in decision-making related to data access. A benchmark of Long Short-Term Memory Artificial Neural Networks (LSTM ANN) forecasting performance is also presented. This work is based on measurements taken on the World-Wide LHC Computing Grid (WLCG), which is employed by the ATLAS experiment at CERN in the domain of High Energy Physics. Index Terms Grid Computing, Remote Data Access, Network Modeling, Time Series Forecasting, Machine Learning I. MOTIVATION Computing grids [1] are collections of geographically sparse data centers. Employing a wide range of heterogenous hardware, these participating facilities work together to reach a common goal in a coordinated manner. Grids store vast Copyright 2018 CERN for the benefit of the ATLAS Collaboration. Reproduction of this article or parts of it is allowed as specified in the CC- BY-4.0 license amounts of scientific data, and numerous users run computational jobs to analyze these data in a highly distributed fashion. For example, within the World-Wide LHC Computing Grid (WLCG) [2] more than 150 computing sites are employed by the ATLAS experiment [3] at CERN. WLCG stores more than 360 petabytes of ATLAS data, which is used for distributed analysis by more than 5000 users. In order to achieve a modular architecture, the grid resources are typically divided into three major classes: storage elements, worker nodes and network. The function of storage elements is to replicate and persist the data for long term using various technologies (e.g. hard disks, solid state drives or tapes). On the other hand, the worker nodes perform CPU intensive tasks. WAN and LAN networks facilitate the data transfer between storage elements and worker nodes. Such strict division of labour has led to the fact that the current best practice for data access in the grid is data placement [4]. Data placement is performed by dedicated distributed data management systems. Consider a job scheduled to run at the data center DC1, while the required input replica is located at the data center DC2. The job may commence its execution only after the completion of the following workflow: (1) the input replica is transferred from the relevant storage element at DC2 to a storage element at DC1; (2) the input replica is staged-in from the relevant storage element at DC1 to the worker node. An alternative approach is remote data access, which allows to directly stream the input replica from a local or remote storage element. To demonstrate the potential of remote data access to optimize the resource utilization, when combined on policy level with traditional data placement, consider two following scenarios. The first example is depicted in Fig. 1. A computational job is submitted to the worker node at the data center DC1. The job requires the input replica R, which is persisted in the remote storage element at the data center DC2. Currently, the storage element in the local data center DC1 has a high load because of transfers to various other data centers. Assume that the bottleneck is only present at this local storage element. This is recognized by the distributed data management system, and it queues the transfer of the input replica R. Employing remote data access, the bottleneck can be immediately bypassed. In another example, the input replica is located at the local storage element. However, the worker node does not have enough disk space available for stage-in. The distributed data

$In contrast, using remote data access to the local storage element, the input data can be streamed in fractions without persisting the complete input file.$

2 Fig. 1. Bottleneck at the local storage element. management system queues the file transfer, until the required disk space becomes available. In contrast, using remote data access to the local storage element, the input data can be streamed in fractions without persisting the complete input file. Taking into account the magnitudes of real life computing grids, it becomes obvious that many more potential scenarios would benefit from remote data access in the context of resource utilization. This has motivated us to study the properties of its network throughput. Furthermore, a forecasting model for such network throughput is needed for planning and coordination of scientific workflows. II. STATE OF THE ART The state of the art approach for data access in computing grids is data placement. An example of a distributed data management system facilitating inter-storage element data placement and stage-in is Rucio [5]. Besides data placement Rucio is also responsible for data cataloging and replication. In its deployment by the ATLAS experiment at CERN, Rucio is coupled with the File Transfer Service (FTS) and the workload management system PanDA [6]. However, the modular architecture allows for a flexible integration with other grid frameworks. Due to its horizontally scalable architecture and high portability, it currently handles about 2 million file transfers, which are equivalent to 2 petabytes, among more than 150 data centers per day. Another example of a data placement framework is Stork [7]. The authors propose to strictly differentiate between data placement and computational activities within the workflows, enabling their asynchronous execution. Stork is interoperable with workflow management systems (e.g. DAGMan [8]), supports various data transfer protocols (e.g. HTTP, GridFTP [9]) and allows to control the load on storage elements and network links. There is an ongoing effort at CERN to enable grid-wide remote data access in the WLCG based on the WebDAV protocol [10] [12]. Currently, it is deployed only at a fraction of the data centers, allowing analysis tasks to access local storage elements. An open issue is the thorough monitoring and logging of such file accesses. Our research on remote data access network throughput is intended to help enable more advanced data access policies. Similar research on network throughput forecasting is presented in [13]. The authors present a methodology for derivation of short term predictions based on regression. The effect of concurrent traffic is not explicitly investigated in this work. Finally, the experiments are performed with data probes of up to 16 megabytes. In contrast, our model delivers long term predictions, taking into account the concurrent traffic of parallel threads and processes. Our work is based on experiments with concurrent file transfers totaling up to tens of gigabytes. III. PARAMETERS INFLUENCING REMOTE DATA ACCESS THROUGHPUT The first phase of this study is the experimental identification of the crucial parameters having impact on remote data access throughput. In our experiment the throughput is indirectly represented through transfer time for a given filesize. Note that these quantities can be easily converted to one another. The hardware of the communicating hosts is a black box. The properties of the hosts are represented exclusively by the observed network throughput. A single computational job is equivalent to a process. Within this process, each requested input file is transferred by an individual thread. The null hypothesis states that for a computational job streaming a given remote file, the transfer time T cannot be linearly regressed onto the following independent variables: (1) S, the size of the original file; (2) ConTh, the cumulative concurrent traffic originating from other threads in the same process during the streaming of the original file; (3) ConPr, the cumulative concurrent traffic originating from other processes in the same worker node during the streaming of the original file. The alternative hypothesis states that such a regression represents the true relationship between the mentioned independent and dependent variables. The independent variable ConPr stands for traffic, which is created as part of the study. Obviously, other concurrent traffic is present on the investigated links, but it is not feasible to monitor all the traffic in the grid. Thus, only the additional effect imposed on the grid by the given bag of jobs is measured. It will be shown further, how other dynamic aspects of the grid are implicitly encapsulated in the forecasting model. To sample the data, 12 computational jobs are launched on a single worker node at the CERN data center (Switzerland). In 15-minute steps the processes initiate remote accesses to the storage element GRIF-LPNHE SCRATCHDISK at the GRIF- LPNHE data center (France) in the following order: 1) Process 1: launch 1 thread to stream 300MB

3 2) Process 1: launch 1 thread to stream 1GB 3) Process 1: launch 1 thread to stream 3GB 4) Processes 1-4: each launch 1 thread to stream 300MB per thread 5) Process 1: launch 2 threads to stream 300MB per thread 6) Processes 1-4: each launch 1 thread to stream 1GB per thread 7) Process 1: launch 2 threads to stream 1GB per thread 8) Processes 1-4: each launch 1 thread to stream 3GB per thread 9) Process 1: launch 2 threads to stream 3GB per thread 10) Processes 1-12: each launch 1 thread to stream 300 MB per thread 11) Process 1: launch 4 threads to stream 300MB per thread 12) Processes 1-12: each launch 1 thread to stream 1GB per thread 13) Process 1: launch 4 threads to stream 1GB per thread 14) Processes 1-12: each launch 1 thread to stream 3GB per thread 15) Process 1: launch 4 threads to stream 3GB per thread Each launched thread is treated as an observation in the final dataset. The drawing interval of 15 minutes was chosen to reduce the probability of overlapping traffic among neighboring drawing steps. Finally, such overlaps did not occur. In each observation, the variable S takes on one of the predefined values (meaning, all the transfers have terminated successfully). Based on the experimental design it is known, what concurrent threads and processes belong to what drawing step. In the period of : :30 we have repeatedly drawn the transfer time T for all the mentioned kinds of threads. Given a single observation x, its value for the variable ConTh is approximated in the following way. First, note down the drawn transfer time T of the observation x. Next, note down the set of all concurrent threads originating from the same process in the same drawing step. Denote this set by xth. Next, iterate over all the elements of xth, denoting the current element by y. If the drawn transfer time T of the thread x is greater than or equal to the drawn transfer time T of the thread y, then add the predefined value for S of y to ConTh of x. Otherwise, calculate the fraction f, by which the transfer times overlap and add f * (S of y) to ConTh of x. The value for the variable ConPr is approximated by an analogous algorithm, based on the drawn transfer time T for concurrent processes. After such sampling and transformations we derive 1122 observations, each having values for the variables T, S, ConTh and ConPr. The protocols used for remote data access are Web- DAV/HTTP [14]. Note that such an intrusive sampling design was necessary, because currently these metrics are not logged. Once proper remote data access monitoring is implemented grid-wide, the observations can be mined from logs unintrusively. A multiple linear regression is fit to the sampled data using ordinary least squares, resulting into the following equation: T = S ConT h ConP r (1) The fit has an F-statistic of 4.549e+04 on 3 degrees of freedom, 1119 residual degrees of freedom and a p-value of <2.2e-16. Thus, the null hypothesis is rejected. The coefficient of S shows that files of various sizes are transferred with various throughputs. Also note that the coefficient of ConTh is about 34.4 times greater than the coefficient of ConPr. This means that concurrent traffic among threads within one process is penalized to a much higher extent in terms of throughput than concurrent traffic among different processes. To check for interaction among the variables ConTh and ConPr, two reduced multiple linear regression models are fit on accordingly adjusted datasets using ordinary least squares. The first reduced model is fit on observations, for which ConPr is equal to 0; the second reduced model is fit on observations, for which ConTh is equal to 0. The resulting models are demonstrated below: T = S ConT h (2) which is statistically significant, having an F-statistic of 3.046e+04 on 2 degrees of freedom and 340 residual degrees of freedom and a p-value of <2.2e-16, and T = S ConP r (3) which is statistically significant with an F-statistic of 1.978e+04 on 2 degrees of freedom and 832 residual degrees of freedom and a p-value of <2.2e-16. These fits are depicted in Fig. 2 and 3. Note, how the coefficient of ConTh has changed in the reduced model by only -1.7% in relation to the full model. The coefficient of ConPr has changed by 2.5%. Among all 3 regression fits, the coefficient of S has changed by at most 9.7% in absolute value. This is a hint that in the true relationship, there is no interaction between the variables ConTh and ConPr. Also note that the Fig. 2 and 3 exhibit few outliers and a certain amount of spread along the axis of the dependent variable for similar (in terms of independent variables) observations. This is to be expected, because the data was sampled over a period of more than 2.5 days, and computing grids are highly dynamic systems. It will be shown further, how our forecasting approach handles this dynamic behavior. It is evident from the figures that once the similar observations get averaged over the sampling period, the linear fit will remain a good approximation of the true relationship. Thus, this finding generalizes well. IV. THROUGHPUT FORECASTING MODEL Our forecasting approach is based on the demonstrated fact that the transfer time T can be regressed onto the variables S, ConTh and ConPr. Consider the time series plot in Fig. 4,

which shows transfer times for 2 threads, transferring 2 gigabytes and 300 megabytes respectively, within 1 computing job in the time window 10.03.2018 22:00-12.04.2018 21:30.

Once per day, the job gets resubmitted to a new worker node in the pool, which is selected by the workload management system. This is done to emulate the real world workflow during job submission.

4 which shows transfer times for 2 threads, transferring 2 gigabytes and 300 megabytes respectively, within 1 computing job in the time window : :30. The observations are drawn once every 30 minutes. The computing job runs in the worker node pool ANALY AUSTRALIA and streams the data from a local storage element AUSTRALIA- ATLAS SCRATCHDISK. Once per day, the job gets resubmitted to a new worker node in the pool, which is selected by the workload management system. This is done to emulate the real world workflow during job submission. Fig. 2. Regression fit T~0+S+ConTh. Fig. 4. Australia: transfer times. It is obvious from the daily shifts along the y-axis that various worker nodes exhibit either very different, or very similar throughputs. Consider e.g. the throughput during the first and the seventh drawing days (here, 1 day is equal to 48 points). Even though there are multiple days between the intervals, and even though during each of these days the transfers run on two different machines (00:1B:21:8F:1F:44 and 00:1B:21:90:0C:60), the throughput is pretty constant. We assume that the large shifts are caused by such factors, as significantly different hardware or network loads. For more advanced analytics, worker nodes with similar throughputs can be clustered together, and based on the number of occurrences of each cluster in a time window of interest, a probability distribution can be derived. This probability distribution can be useful for scheduling, since it will give an insight, what fractions of jobs in a computational campaign are likely to be submitted to what families of worker nodes. Let a, b and c denote the coefficients of the regression Fig. 3. Regression fit T~0+S+ConPr. T 0 + a S + b ConT h + c ConP r (4) From the Fig. 4 it is evident that the values of the coefficients vary over time. The sampled multivariate time series can be transformed into a multivariate time series of coefficients a and b. The reduced regression is written as T 0 + a S + b ConT h (5)

5 Let T 1 and T 2 be the sampled transfer times for the 2GB and 300MB files respectively. Since the transfer of the 300MB file always terminates first in our data, the coefficients a and b can be calculated for each time step using the following system of equations: { T 1 = 2000 a b (6) T 2 = 300 a + (2000 (T 2 /T 1 )) b after substitution we get: a = ((17 T 1 T 2 )/(100 (9 T T 2 ))) (7) and b = (T a)/300 (8) Note that we did not sample data, which would allow us to reconstruct the time series for the coefficient c. This is due to the previously demonstrated fact that a vast amount of concurrent traffic among processes is required, in order to reach a noticeable effect. Such sampling would be highly intrusive. Also note that the complete multivariate time series for the coefficients a, b and c can be unintrusively mined from logs, once proper remote data access monitoring is implemented. We have transformed the sampled time series of transfer times from Fig. 4 into time series of coefficients a and b. Next, for forecasting purposes, we have normalized it with respect to the daily shifts among clusters of worker nodes. This normalization can be reversed. The rest of the variance in the normalized data comes from factors, which are not exactly known due to the high complexity and dynamic behavior of the grid. We assume that the remaining variance is determined by the unexplained behavior of the storage elements and network links. Finally, we have simulated the variate for the coefficient c, based on the previous demonstration that in case of remote data access from CERN to GRIF-LPNHE data centers, its value on average was 34.4 times smaller, than that of the coefficient b. The resulting time series is presented in Fig. 5. We have also performed an analogous sampling and analysis for the storage element - worker node pool pair BNL- OSG2 SCRATCHDISK - ANALY BNL LONG. The transfer time and coefficients time series are shown in Fig. 6 and 7. Note that in infrequent instances in Fig. 7 the values of the coefficients are negative. This is due to the fact that the time series is normalized with respect to the first drawing day. Obviously, these observations can be transformed or cleaned in accordance with the specific analysis requirements. The demonstrated multivariate coefficient time series encapsulates all the unknown factors, which are associated with the high complexity and dynamic behavior of the grid. Using time series forecasting methods, its future values can be derived. Thus, it allows to calculate the normalized transfer time (and thus throughput) given arbitrary amounts for S, Fig. 5. Australia: normalized coefficients. Fig. 6. BNL: transfer times. ConTh and ConPr, at any past, present or forecasted time step. For advanced analytics, based on the probability distribution representing job assignment to various clusters of worker nodes and the inverse transform for neutralization of the initial normalization, concrete transfer times can be estimated. However, such estimation will only make sense for bags of jobs. In this case the assignment of numerous jobs to various clusters of worker nodes will lend itself to a meaningful probabilistic analysis. A. Input Features of Forecasting Model The forecasting model individually predicts the 3 variates of the multivariate coefficients time series. When forecasting a single variate, following features are considered as inputs to the forecasting model: 1) Lagged values of the variate itself. 2) Lagged values of 2 other variates.

Fig. 9. BNL: CCF between normalized coefficients a and b. Fig. 7. BNL: normalized coefficients.

8 and 9, which depict the cross-correlation functions between the normalized coefficients a and b sampled at data centers Australia-ATLAS and BNL-ATLAS.

intra- and inter-data center traffic in bytes; (4) amount of intra-data center file accesses. It can be seen from the crosscorrelation plots in Fig.

6 Fig. 9. BNL: CCF between normalized coefficients a and b. Fig. 7. BNL: normalized coefficients. 3) Slightly lagged values of the time series representing the overall traffic at the data center of interest. Consider Fig. 8 and 9, which depict the cross-correlation functions between the normalized coefficients a and b sampled at data centers Australia-ATLAS and BNL-ATLAS. we have mined 4 different traffic time series for the period of interest at the data center Australia-ATLAS: (1) intra-data center traffic in bytes; (2) inter-data center traffic in bytes; (3) sum of intra- and inter-data center traffic in bytes; (4) amount of intra-data center file accesses. It can be seen from the crosscorrelation plots in Fig. 10 that none of the 4 mentioned traffic time series is significantly leading the normalized coefficient a at small lags with positive cross-correlation. Due to the symmetric CCF between the coefficients a and b, this result is also present in the analysis of CCF between the mentioned traffic categories and the normalized coefficient b. The same analysis delivers no meaningful results at the BNL-ATLAS data center. Thus, the overall traffic time series is not used as an input feature for the prediction of the coefficients. Fig. 8. Australia: CCF between normalized coefficients a and b. It is evident from the CCF plots that the values are highly symmetric around the origin. Thus, we do not use other variates as input features to predict a given variate, since such input features will not bring any additional valuable information for model construction. Without a posteriori knowledge, a plausible hypothesis is that the time series of overall network load at the data center of interest leads the amplitude of the coefficients time series with a positive cross-correlation at small negative lags (i.e. if the amount of the overall traffic increases, the time to transfer an additional unit will increase shortly after). From the logs recorded by the distributed data management system Rucio, Fig. 10. Australia: CCF between traffic time series and normalized coefficient a.

7 Finally, only the lagged values of the variate of interest will be used as inputs to predict the future values. As previously mentioned, we have not sampled the authentic data for derivation of coefficient c. We assume that in the demonstrated scenarios the CCF between c and other variates exhibits an analogous symmetric behavior. B. Benchmark of Forecasting Methods The time series of the normalized coefficient a at the Australia-ATLAS (Fig. 5) and BNL-ATLAS (Fig. 7) data centers are used for demonstrational purposes in this paper. Due to the shown symmetric CCF between coefficients a and b it is known that the coefficient b exhibits an analogous behavior. We assume, that so does the coefficient c. The previously presented time series for data centers Australia-ATLAS and BNL-ATLAS are split into 16 datasets respectively, each spanning a period of 2 drawing days. 80% of each dataset are used for the construction of the model, the remaining 20% are used for its evaluation using normalized root mean squared error (NRMSE) as the metric of choice. Thus, we are forecasting the series for the next 9.5 hours. Following methods are benchmarked in context of coefficients time series forecasting: (1) Auto-Regressive Integrated Moving Average (ARIMA) [15]; (2) Long Short-Term Memory Artificial Neural Networks (LSTM ANN) [16]. ARMA is a statistical state of the art approach for derivation of a mathematical model describing the time series of interest in terms of its past values (AR part) and errors (MA part). ARMA models work with stationary time series. ARIMA, the integrated version of ARMA, allows to work with time series, which become stationary after a certain degree of differencing. The model parameters are identified based on the analysis of the (partial) auto-correlation function and stationarity tests. Consider Fig. 11, which depicts the auto-correlation functions of the normalized coefficients a and b at both the data centers. It is evident from the figure that all the time series have a seasonal component with a period of 48 points, which is equivalent to a day. To fit ARIMA models we use the auto.arima() function from R [17], which will estimate the model parameters based on a number of statistical tests. When passing a time series object to the function, we indicate the discovered seasonality. Long Short-Term Memory neural networks are a kind of recurrent neural network. LSTM networks are optimized for learning significant information at arbitrarily extended lags. They have been used for time series forecasting by researchers in different domains, e.g. for traffic speed prediction [18], detection of faults in industrial data [19], etc. It has been also demonstrated at the recent time series forecasting competition CIF 2016 [20], hosted by IEEE that LSTM has delivered outstanding results, outperforming common statistical and soft computing methods. In our work we use the Keras API [21] to construct the neural network instances. The input to the network is onedimensional, accepting a single timestep from the time series. The network has a single hidden layer consisting of an LSTM Fig. 11. ACF of normalized coefficients a and b. cell. The output layer has a size of 19. Thus, the model is trained to do a one-shot 19-step-ahead forecast given a single input observation. The training lasts 1000 epochs. The hyperparameters are chosen based on initial tuning. Before training, first order differencing and then scaling using the interval [-1;1] are applied to the data. In comparison to ARIMA, LSTM has following benefits: LSTM allows to frame the learning problem in a flexible manner. E.g. the specified shape of the output layer will tailor the training procedure with the goal of delivering forecasts of the specific length of interest. ARIMA requires a thorough statistical analysis and treatment of the underlying time series (e.g. in terms of stationarity, Gaussian distribution of residuals, etc.). LSTM does not explicitly pose such restrictions. Conversely, ARIMA models are computationally cheaper to construct. Given the input time series, the auto.arima() function returns a deterministically computed ARIMA model. However, due to the random initialization of weights in the LSTM model, a trained instance may differ from run to run. Thus, in order to compare the performance of both approaches without setting a random seed, a new LSTM model was trained 50 times on each dataset from the Australia-ATLAS and BNL-ATLAS data centers. For each of the datasets the corresponding LSTM training runs are considered, and the log changes [22] between ARIMA and LSTM NRMSE (holding ARIMA NRMSE as reference) are computed, resulting into a distribution. Our findings show that given a dataset, one approach will systematically outperform the other. In rare cases (for 5 out of 32 datasets), both methods perform equally well on average. Consider Fig. 12 and 13, which show concrete examples of LSTM outperforming ARIMA and vice versa. For

8 these 2 datasets, the distributions of log NRMSE changes are depicted in Fig. 14. Fig. 14. Distributions of log NRMSE changes for 2 datasets. Fig. 12. LSTM outperforming ARIMA(1,0,0)(1,0,0)[48]. Fig. 13. ARIMA(0,1,1) outperforming LSTM. For these 2 datasets, the distributions of log NRMSE changes are depicted in Fig. 14. We have performed the augmented Dickey-Fuller test implying stationarity under the alternative hypothesis on each dataset (on its fraction used for model construction). We have then tested for an association between the ADF test statistic and the type of the forecasting method, which dominates in terms of performance given the dataset. Unfortunately, this analysis did not deliver significant results. Thus, the method, which is more accurate on average needs to be selected. Fig. 15 shows boxplots of all measured log NRMSE changes for the two separate data centers. The average log NRMSE changes are 0.51 and 0.57 for Australia-ATLAS and BNL- ATLAS data centers respectively. Thus, on average, ARIMA outperforms LSTM. Based on these findings, we adopt the ARIMA method for our throughput forecasting model. The boxplots of ARIMA NRMSE for all datasets are shown in Fig. 16. Fig. 15. Log NRMSE changes. Fig. 16. Performance of ARIMA in terms of NRMSE.

9 V. CONCLUSIONS AND FUTURE WORK In this paper we have demonstrated that the remote data access throughput can be framed as a multiple linear regression. Furthermore, we have shown, how the estimates for the regression coefficients can be mined in form of time series from logs provided by the remote data access monitoring system. We have identified useful input features for the forecasting model. Based on the ARIMA method, it is possible to do long term forecasts for the coefficient time series with state of the art accuracy. A benchmark of LSTM ANN has partially shown good results, but did not outperform ARIMA. The throughput forecasting model was constructed and validated using performance metrics from ATLAS data centers in the World-Wide LHC Computing Grid. Currently we are focusing on the following aspects for the future work: A heuristic model for selection of data access profile in computational jobs. Given a bag of jobs, the model decides what input data should be accessed remotely or in the data placement fashion, with the goal to minimize the total transfer time. Since the model handles bags of jobs, it also deals with bottlenecks on the side of the storage elements and network links. A framework for remote data access management and monitoring, which is interoperable with the overall grid environment. ACKNOWLEDGMENT This work was performed on behalf of the ATLAS Collaboration. REFERENCES [1] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure. Elsevier, [2] I. Bird, Computing for the Large Hadron Collider, Annual Review of Nuclear and Particle Science, vol. 61, pp , [3] ATLAS Collaboration, The ATLAS Experiment at the CERN Large Hadron Collider, 2008 JINST 3 S [4] A. Chervenak, E. Deelman, M. Livny, M.-H. Su, R. Schuler, S. Bharathi, G. Mehta, and K. Vahi, Data Placement for Scientific Applications in Distributed Environments, in Proceedings of the 8th IEEE/ACM International Conference on Grid Computing. IEEE Computer Society, 2007, pp [5] V. Garonne, R. Vigne, G. Stewart, M. Barisits, M. Lassnig, C. Serfon, L. Goossens, and A. Nairz, on behalf of ATLAS Collaboration, Rucio The next generation of large scale distributed system for ATLAS Data Management, in Journal of Physics: Conference Series, vol. 513, no. 4. IOP Publishing, 2014, p [6] T. Maeno, on behalf of ATLAS Collaboration, PanDA: Distributed Production and Distributed Analysis System for ATLAS, in Journal of Physics: Conference Series, vol. 119, no. 6. IOP Publishing, 2008, p [7] T. Kosar and M. Livny, Stork: Making Data Placement a First Class Citizen in the Grid, in 24th International Conference on Distributed Computing Systems, 2004, pp [8] J. Frey, Condor DAGMan: Handling Inter-Job Dependencies, University of Wisconsin, Dept. of Computer Science, Tech. Rep, [9] J. Bresnahan, M. Link, G. Khanna, Z. Imani, R. Kettimuthu, and I. Foster, Globus GridFTP: what s new in 2007, in Proceedings of the first international conference on Networks for grid applications. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2007, p. 19. [10] J. Elmsheuser, R. Walker, C. Serfon, V. Garonne, S. Blunier, V. Lavorini, and P. Nilsson, on behalf of ATLAS Collaboration, New data access with HTTP/WebDAV in the ATLAS experiment, in Journal of Physics: Conference Series, vol. 664, no. 4. IOP Publishing, 2015, p [11] G. Bernabeu, F. Martinez, E. Acción, A. Bria, M. Caubet, M. Delfino, and X. Espinal, Experiences with http/webdav protocols for data access in high throughput computing, in Journal of Physics: Conference Series, vol. 331, no. 7. IOP Publishing, 2011, p [12] F. Furano, A. Devresse, O. Keeble, M. Hellmich, and A. Á. Ayllón, Towards an HTTP Ecosystem for HEP Data Access, in Journal of Physics: Conference Series, vol. 513, no. 3. IOP Publishing, 2014, p [13] M. Faerman, A. Su, R. Wolski, and F. Berman, Adaptive Performance Prediction for Distributed Data-Intensive Applications, in Supercomputing, ACM/IEEE 1999 Conference. IEEE, 1999, pp [14] Y. Goland, E. Whitehead, A. Faizi, S. Carter, and D. Jensen, HTTP Extensions for Distributed Authoring WEBDAV, Tech. Rep., [15] R. Shumway and D. Stoffer, Time Series Analysis and Its Applications: With R Examples, ser. Springer Texts in Statistics. Springer New York, [16] S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol. 9, no. 8, pp , [17] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2008, ISBN [Online]. Available: [18] X. Ma, Z. Tao, Y. Wang, H. Yu, and Y. Wang, Long short-term memory neural network for traffic speed prediction using remote microwave sensor data, Transportation Research Part C: Emerging Technologies, vol. 54, pp , [19] P. Filonov, A. Lavrentyev, and A. Vorontsov, Multivariate Industrial Time Series with Cyber-Attack Simulation: Fault Detection Using an LSTM-based Predictive Data Model, arxiv preprint arxiv: , [20] M. Štěpnička and M. Burda, On the Results and Observations of the Time Series Forecasting Competition CIF 2016, in 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), July 2017, pp [21] F. Chollet et al., Keras, [22] L. Törnqvist, P. Vartia, and Y. O. Vartia, How Should Relative Changes be Measured? The American Statistician, vol. 39, no. 1, pp , 1985.

Popularity Prediction Tool for ATLAS Distributed Data Management

Popularity Prediction Tool for ATLAS Distributed Data Management T Beermann 1,2, P Maettig 1, G Stewart 2, 3, M Lassnig 2, V Garonne 2, M Barisits 2, R Vigne 2, C Serfon 2, L Goossens 2, A Nairz 2 and