Department of Physics & Astronomy

Similar documents
Evaluation of an Economy-Based File Replication Strategy for a Data Grid

Dynamic Data Grid Replication Strategy Based on Internet Hierarchy

Experience of Data Grid simulation packages using.

MANAGEMENT AND PLACEMENT OF REPLICAS IN A HIERARCHICAL DATA GRID

Network Requirements. September, Introduction

Currently, Grid computing has emerged as a

RUSSIAN DATA INTENSIVE GRID (RDIG): CURRENT STATUS AND PERSPECTIVES TOWARD NATIONAL GRID INITIATIVE

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

A Novel Data Replication Policy in Data Grid

Future Generation Computer Systems. PDDRA: A new pre-fetching based dynamic data replication algorithm in data grids

The EU DataGrid Testbed

Replication and scheduling Methods Based on Prediction in Data Grid

CHIPP Phoenix Cluster Inauguration

Database monitoring and service validation. Dirk Duellmann CERN IT/PSS and 3D

Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

The LHC Computing Grid

Understanding the T2 traffic in CMS during Run-1

Computing for LHC in Germany

HOME GOLE Status. David Foster CERN Hawaii GLIF January CERN - IT Department CH-1211 Genève 23 Switzerland

Clouds in High Energy Physics

The European DataGRID Production Testbed

Data location-aware job scheduling in the grid. Application to the GridWay metascheduler

An Analysis of Storage Interface Usages at a Large, MultiExperiment Tier 1

Status of KISTI Tier2 Center for ALICE

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

RDMS CMS Computing Activities before the LHC start

arxiv: v1 [cs.dc] 20 Jul 2015

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

e-science for High-Energy Physics in Korea

The Grid: Processing the Data from the World s Largest Scientific Machine

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

The grid for LHC Data Analysis

CC-IN2P3: A High Performance Data Center for Research

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

High Throughput WAN Data Transfer with Hadoop-based Storage

A Set Coverage-based Mapping Heuristic for Scheduling Distributed Data-Intensive Applications on Global Grids

IEPSAS-Kosice: experiences in running LCG site

Data Management for the World s Largest Machine

Automatic Data Placement and Replication in Grids

A Toolkit for Modelling and Simulation of Data Grids with Integration of Data Storage, Replication and Analysis

A Set Coverage-based Mapping Heuristic for Scheduling Distributed Data-Intensive Applications on Global Grids

University of Castilla-La Mancha

Towards Network Awareness in LHC Computing

(HT)Condor - Past and Future

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

CernVM-FS beyond LHC computing

Andrea Sciabà CERN, Switzerland

A Popularity-Based Prediction and Data Redistribution Tool for ATLAS Distributed Data Management

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Simulation model and instrument to evaluate replication technologies

Federated data storage system prototype for LHC experiments and data intensive science

Networks for Research and Education in Europe in the Age of Fibre - Where do we move? -

The LHC Computing Grid. Slides mostly by: Dr Ian Bird LCG Project Leader 18 March 2008

Distributed e-infrastructures for data intensive science

A Dynamic Replication Strategy based on Exponential Growth/Decay Rate

A short introduction to the Worldwide LHC Computing Grid. Maarten Litmaath (CERN)

Overview of ATLAS PanDA Workload Management

Data transfer over the wide area network with a large round trip time

A Simulation Model for Large Scale Distributed Systems

e-science Infrastructure and Applications in Taiwan Eric Yen and Simon C. Lin ASGC, Taiwan Apr. 2008

CMS LHC-Computing. Paolo Capiluppi Dept. of Physics and INFN Bologna. P. Capiluppi - CSN1 Catania 18/09/2002

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

ALHAD G. APTE, BARC 2nd GARUDA PARTNERS MEET ON 15th & 16th SEPT. 2006

IPv6 Address Allocation Policies and Management

RDT: A New Data Replication Algorithm for Hierarchical Data Grid

Monte Carlo Production on the Grid by the H1 Collaboration

Improved ATLAS HammerCloud Monitoring for Local Site Administration

The Grid. Processing the Data from the World s Largest Scientific Machine II Brazilian LHC Computing Workshop

Batch Services at CERN: Status and Future Evolution

The CMS Computing Model

LHCb Computing Resource usage in 2017

Evolution of Database Replication Technologies for WLCG

Multi-threaded, discrete event simulation of distributed computing systems

Clouds at other sites T2-type computing

A 3-tier Grid Architecture and Interactive Applications Framework for Community Grids

On the employment of LCG GRID middleware

Experience with LCG-2 and Storage Resource Management Middleware

Popularity Prediction Tool for ATLAS Distributed Data Management

Challenges of the LHC Computing Grid by the CMS experiment

Grids and Security. Ian Neilson Grid Deployment Group CERN. TF-CSIRT London 27 Jan

World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:2, No:4, 2008

CMS users data management service integration and first experiences with its NoSQL data storage

DIRAC Distributed Infrastructure with Remote Agent Control

THE Compact Muon Solenoid (CMS) is one of four particle

DIRAC Distributed Infrastructure with Remote Agent Control

Grid Computing: dealing with GB/s dataflows

13th International Workshop on Advanced Computing and Analysis Techniques in Physics Research ACAT 2010 Jaipur, India February

A data handling system for modern and future Fermilab experiments

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

CERN network overview. Ryszard Jurga

The LHC Computing Grid Project in Spain (LCG-ES) Presentation to RECFA

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

A new efficient Virtual Machine load balancing Algorithm for a cloud computing environment

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

Long Term Data Preservation for CDF at INFN-CNAF

Transcription:

Department of Physics & Astronomy Experimental Particle Physics Group Kelvin Building, University of Glasgow, Glasgow, G12 8QQ, Scotland Telephone: +44 (0)141 339 8855 Fax: +44 (0)141 330 5881 GLAS-PPE/2004-?? 29 th September 2004 OPTORSIM: A SIMULATION TOOL FOR SCHEDULING AND REPLICA OPTIMISATION IN DATA GRIDS D. G. Cameron 1, R. Carvajal-Schiaffino 2, A. P. Millar 1, C. Nicholson 1, K. Stockinger 3, F. Zini 2 1 University of Glasgow, Glasgow, G12 8QQ, Scotland 2 ITC-irst, Via Sommarive 18, 38050 Povo (Trento), Italy 3 Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Abstract In large-scale grids, the replication of files to different sites and the appropriate scheduling of jobs are both important mechanisms which can reduce access latencies and give improved usage of resources such as network bandwidth, storage and computing power. In the search for an optimal scheduling and replication strategy, the grid simulator OptorSim was developed as part of the European DataGrid project. Simulations of various high energy physics (HEP) grid scenarios have been undertaken using different job scheduling and file replication algorithms, with the emphasis being on physics analysis use-cases. An economy-based strategy has been investigated as well as more traditional methods, with the economic models showing advantages for heavily loaded grids. CHEP 2004 Interlaken, Switzerland

1 Introduction The remit of the European DataGrid (EDG) [1] project was to develop an infrastructure that could support the intensive computational and data handling needs of widely distributed scientific communities. In a grid with limited resources, it is important to make the best use of these resources, whether computational, storage or network. It has been shown [3, 4] that data replication - the process of placing copies of files at different sites - is an important mechanism for reducing data access times, while good job scheduling is crucial to ensure effective usage of resources. A useful way of exploring different possible optimisation algorithms is by simulation, and as part of EDG, the grid simulator OptorSim was developed and used to simulate several grid scenarios such as the CMS Data Challenge 2002 testbed [2]. Although the EDG project has now finished, the work on OptorSim has continued, with emphasis on particle physics data grids such as the LHC Computing Grid (LCG). In this paper, an overview of OptorSim s design and implementation is presented, showing its usefulness as a grid simulator both in its current features and in the ease of extensibility to new scheduling and replication algorithms. This is followed by a description of some recent experiments and results. 2 Simulation Design There are a number of elements which should be included in a grid simulation to achieve a realistic environment. These include: computing resources to which jobs can be sent; storage resources where data can be kept; a scheduler to decide where jobs should be sent; and the network which connects the sites. For a grid with automated file replication, there must also be a component to perform the replica management. It should be easy to investigate different algorithms for both scheduling and replication and to input different topologies and workloads. 2.1 Architecture OptorSim is designed to fulfil the above requirements, with an architecture (Figure 1) based on that of the EDG data management components. In this model, computing and storage resources are represented by Computing Users Grid Job Grid Resource Broker Grid Site Computing Grid Job Grid Site Computing Replica Manager Optimiser Replica Manager Optimiser Storage Data File Storage Figure 1: OptorSim Architecture. s (CEs) and Storage s (SEs) respectively, which are organised in Grid Sites. CEs run jobs by processing data files, which are stored in the SEs. A Resource Broker (RB) controls the scheduling of jobs to Grid Sites. Each site handles its file content with a Replica Manager (RM), within which a Replica Optimiser (RO) contains the replication algorithm which drives automatic creation and deletion of replicas. 2.2 Input Parameters A simulation is set up by means of configuration files: one which defines the grid topology and resources, one the jobs and their associated files, and one the parameters and algorithms to use. The most important parameters 1

include: the access pattern with which the jobs access files; the submission pattern with which the users send jobs to the RB; the level and variability of non-grid traffic present; and the optimisation algorithms to use. A full description of each is in the OptorSim User Guide [5]. 2.3 Optimisation Algorithms There are two types of optimisation which may be investigated with OptorSim: the scheduling algorithms used by the RB to allocate jobs, and the replication algorithms used by the RM at each site to decide when and how to replicate. Scheduling Algorithms. The job scheduling algorithms are based on reducing the cost needed to run a job. Those currently implemented are: Random (a site is chosen at random); Access Cost (cost is the time needed to access all the files needed for the job); Queue Size (cost is the number of jobs in the queue at that site); and Queue Access Cost (the combined access cost for every job in the queue, plus the current job). Apart from with the Random scheduler, the site with the lowest cost is chosen. It should be noted that this cost is unrelated to the price of files in the economic model for replication, described below. Replication Algorithms. There are three broad options for replication strategies in OptorSim. Firstly, one can choose to perform no replication. Secondly, one can use a traditional algorithm which, when presented with a file request, always tries to replicate and, if necessary, deletes existing files to do so. Algorithms in this category are the LRU (Least Recently Used), which deletes those files which have been used least recently, and the LFU (Least Frequently Used), which deletes those which have been used least frequently in the recent past. Thirdly, one can use an economic model in which sites buy and sell files using an auction mechanism, and will only delete files if they are less valuable than the new file. Details of the auction mechanism and file value prediction algorithms can be found in [6]. There are currently two versions of the economic model: the binomial economic model, where file values are predicted by ranking the files in a binomial distribution according to their popularity in the recent past, and the Zipf economic model, where a Zipf-like distribution is used instead. 2.4 Implementation In OptorSim, which is written in Java TM, each CE is represented by a thread, with another thread acting as the RB. The RB sends jobs to the CEs according to the specified scheduling algorithm and the CEs process the jobs by accessing the required files, running one job at a time. In the current implementation, the number of worker nodes for each CE simply reduces the time a file takes for processing, rather than allowing jobs to run simultaneously. When a file is needed, the CE calls the getbestfile() method of the RO being used. The replication algorithm is then used to search for the best replica to use. Each scheduling and replication algorithm is implemented as a separate Resource Broker or Replica Optimiser class respectively and the appropriate class is instantiated at run-time, making the code easily extensible. There are two time models implemented, one time-based and one event-driven, and OptorSim can be run in either mode with the same end results. In the time-based model, the simulation proceeds in real time. In the event-driven model, whenever all the CE and RB threads are inactive, the simulation time is advanced to the point when the next thread should be activated. The use of the event-driven model speeds up the running of the simulation considerably, whereas the time-based model may be desirable for demonstration or other purposes. OptorSim can be run from the command-line or using a graphical user interface (GUI). A number of statistics are gathered as the simulation runs, including total and individual job times, number of replications, local and remote file accesses, volume of storage filled and percentage of time that CEs are active. If using the commandline, these are output at the end of the simulation in a hierarchical way for the whole grid, individual sites and site components. If the GUI is used, these can also be monitored in real time. 3 Experimental Setup Two grid configurations which have been simulated recently are the CMS 1) Data Challenge 2002 testbed (Figure 3) and the LCG August 2004 testbed (Figure 4). For the CMS testbed, CERN and FNAL were given SEs of 100 GB capacity and no CEs. All master files were stored at one of these sites. Every other site was given 50 GB of storage and a CE with one worker node. For the LCG testbed, resources were based on those published by the LCG Grid Deployment Board for Quarter 4 of 2004 [7], but with SE capacities reduced by a factor of 100 and number of worker nodes per CE halved, 1) Compact Muon Solenoid, one of the experiments for the Large Hadron Collider (LHC) at CERN. 2

RAL IC Bristol Moscow Wisconsin UK Russia FNAL USA1 France Switzerland Caltech USA2 USA3 1G Italy 45M Padova/Legarno Bari UCSD Lyon 1G CERN Perugia UFL 45M 45M CMS testbed site Torino Catania Router Pisa Roma Bologna Firenze Figure 2: CMS Data Challenge 2002 grid topology. BEIJING CN TRIUMF VAN SEA JP ICEPP CA NCU TAIPEI ALBERTA ED CAL TW CARLETON TORONTO WIN ASCC TOR CHI OTT MON PAKISTAN LIP NY FNAL SCOTGRID MONTREAL BNL TCD CIEMAT CNB USC RAL CGG LPC LYON IFCA INTA ES IFIC 10 Gbps Tier 0 site 3 Gbps Tier 1 site 2.5 Gbps 1 Gbps Tier 2 site 622 Mbps Router 155 Mbps 33 Mbps NORTHGRID GB FR UAM SOUTHGRID LONDON PIC UB SARA MILANO LEGNARO CH PL NIKHEF NL CERN SE IT DESY FZK DE CSCS TORINO IHEP WUPP CNAF RU KR WARSAW CZ ROMA FRASCATI NAPOLI BERLIN RWTH EKP GSI ITEP SINP SCAI IL JINR AT GR UIBK PRAGUE CESNET HU BUDAPEST HELLASGRID AUTH WEIZMANN TEL AVIV Figure 3: LCG August 2004 grid topology. to achieve useful results while keeping the running time of the simulation low. All master files were placed at CERN. In both cases, a set of 6 job types were run, based on a CDF (Collider Detector at Fermilab) use case [8], with a total dataset size of 97 GB. Testbed No. of Sites D/ SE W N C (Mbit/s) CMS 20 1.764 1 507 LCG 65 0.238 108 463 Table 1: Comparison of Testbeds Used. In order to compare results from these testbeds, it is necessary to summarise their main characteristics. Useful metrics are: the ratio of the dataset size to the average SE size, D/ SE ; the average number of worker nodes per CE, W N ; and the average connectivity of a site, C. The values of these metrics for the two testbeds are shown in Table 1. Some general statements can be made about these characteristics: D/ SE. A low value of D/ SE indicates that the SEs have more space than is required by the files. Little deletion will take place and one would expect the different replication algorithms to have little effect. W N. A high value of W N will result in jobs being processed very quickly. If the job processing rate is higher than the submission rate, there will then be little queueing and the mean job time will be short. A low number of worker nodes could lead to processing rate being lower than the submission rate and thus to escalating queues and job times. C. A high C will result in fast file transfer times and hence fast job times. This will have a similar effect on the ratio of job processing rate to submission rate as described above for W N. Another important factor is the presence or absence of a CE at the site(s) which initially hold(s) all the files. In OptorSim, the intra-site bandwidth is assumed to be infinite, so if a file is local there are no transfer costs 3

involved. For scheduling algorithms which consider the transfer costs, most of the jobs will therefore get sent to that site. 4 Results 4.1 CMS Data Challenge 2002 testbed With the CMS testbed, three of the replication algorithms (LFU, binomial economic and Zipf-based economic) were compared for the four scheduling algorithms, with 1000 jobs submitted to the grid. The mean job times are shown in Figure 5. This shows that schedulers which consider the processing cost of jobs at a site possess a clear 30000 Eco Bin Eco Zipf 25000 LFU Mean Job Time (s) 20000 15000 10000 5000 0 Random Queue Length Access Cost Queue Access Cost Scheduler Figure 4: Mean job time for scheduling and replication algorithms in the CMS 2002 testbed. advantage, as mean job time is reduced considerably for the Access Cost and Queue Access Cost schedulers. It can also be seen that the LFU replication algorithm is faster than the economic models for this number of jobs. This may be due to the low value of W N ; as the economic models have an overhead due to the auctioning time, there will initially be more queue build-up than with the LFU. A study was also made of how the replication algorithms reacted to increasing the total number of jobs (Figure 6). As the number of jobs on the grid increases, the mean job time also increases. One would expect 10 5 Eco Bin Eco Zipf LFU 10 4 Mean Job Time (s) 10 3 10 2 10 1 1 10 10 2 10 3 10 4 Number of Jobs Figure 5: Mean job time for increasing number of jobs in CMS 2002 testbed. Points are displaced for clarity. that it should decrease if the replication algorithms are effective, but with the low value of W N in this case, the job submission rate is higher than the processing rate, leading to runaway job times. However, the performance of the economic models improves in comparison to the LFU and when 10000 jobs are run, the Zipf economic 4

model is faster. For long-term optimisation, therefore, the economic models could be better at placing replicas where they will be needed. 4.2 LCG August 2004 testbed The pattern of results for the scheduling algorithms in the LCG testbed (Figure 7) is similar to that for CMS. The Access Cost and Queue Access Cost algorithms are in this case indistinguishable, however, and the mean job time for the LFU algorithm is negligibly small. This is due to the fact that in this case, CERN (which contains all the master files) has a CE. When a scheduler is considering access costs, CERN will have the lowest cost and the job will be sent there. This is also a grid where the storage resources are such that a file deletion 6000 Eco Bin Eco Zipf LFU 5000 Mean Job Time (s) 4000 3000 2000 1000 0 Random Queue Length Access Cost Queue Access Cost Scheduler Figure 6: Mean job time for scheduling and replication algorithms in LCG August 2004 testbed. algorithm is unnecessary and a simple algorithm such as the LFU runs faster than the economic models, which are slowed down by the auctioning time. It would therefore be useful to repeat these experiments with a heavier workload, such that D/ SE is large enough to reveal the true performance of the algorithms. 5 Conclusions The investigation of possible replication and scheduling algorithms is important for optimisation of resource usage and job throughput in HEP data grids, and simulation is a useful tool for this. The grid simulator OptorSim has been developed and experiments performed with various grid topologies. These show that schedulers which consider the availability of data at a site give lower job times. For heavily-loaded grids with limited resources, it has been shown that the economic models which have been developed begin to out-perform more traditional algorithms such as the LFU as the number of jobs increases, whereas for grids with an abundance of resources and a lighter load, a simpler algorithm like the LFU may be better. OptorSim has given valuable results and is easily adaptable to new scenarios and new algorithms. Future work will include continued experimentation with different site policies, job submission patterns and file sizes, in the context of a complex grid such as LCG. Acknowledgments This work was partially funded by the European Commission program IST-2000-25182 through the EU Data- GRID Project, the ScotGrid Project and PPARC. The authors would also like to thank Jamie Ferguson for the development of the OptorSim GUI. References [1] The European DataGrid Project, http://www.edg.org 5

[2] D. Cameron, R. Carvajal-Schiaffino, P. Millar, C. Nicholson, K. Stockinger and F. Zini, Evaluating Scheduling and Replica Optimisation Strategies in OptorSim, Journal of Grid Computing (to appear) [3] K. Ranganathan and I. Foster, Identifying Dynamic Replication Strategies for a High Performance Data Grid, Proc. of the Int. Grid Computing Workshop, Denver, Nov. 2001 [4] W. Bell, D. Cameron, L. Capozza, P. Millar, K. Stockinger and F. Zini, OptorSim - A Grid Simulator for Studying Dynamic Data Replication Strategies, Int. J. of High Performance Computing Applications 17 (2003) 4 [5] W. Bell, D. Cameron, R. Carvajal-Schiaffino, P. Millar, C. Nicholson, K. Stockinger and F. Zini, OptorSim v1.0 Installation and User Guide, February 2004 [6] W. Bell, D. Cameron, R. Carvajal-Schiaffino, P. Millar, K. Stockinger and F. Zini Evaluation of an Economy-Based Replication Strategy for a Data Grid, Int. Workshop on Agent Based Cluster and Grid Computing, Tokyo, 2003 [7] GDB Resource Allocation and Planning, http://lcg-computing-fabric.web.cern.ch/lcg-computing- Fabric/GDB reosurce allocation planning.htm [8] B. T. Huffman, R. McNulty, T. Shears, R. St. Denis and D. Waters The CDF/D0 UK GridPP Project, CDF Internal Note 5858, 2002 6