Experiences in testing a Grid service in a production environment

Similar documents
Introduction to SRM. Riccardo Zappi 1

Understanding StoRM: from introduction to internals

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

glite Grid Services Overview

and the GridKa mass storage system Jos van Wezel / GridKa

Promoting Open Standards for Digital Repository. case study examples and challenges

The CMS experiment workflows on StoRM based storage at Tier-1 and Tier-2 centers

Storage Resource Manager Version 2.2: design, implementation, and testing experience

Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008

The INFN Tier1. 1. INFN-CNAF, Italy

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

CernVM-FS beyond LHC computing

Scientific data management

Storage Resource Management: Concepts, Functionality, and Interface Specification

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Long Term Data Preservation for CDF at INFN-CNAF

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

The glite middleware. Presented by John White EGEE-II JRA1 Dep. Manager On behalf of JRA1 Enabling Grids for E-sciencE

The LHC Computing Grid

Andrea Sciabà CERN, Switzerland

A Simple Mass Storage System for the SRB Data Grid

Storage Resource Manager Interface Specification V2.2 Implementations Experience Report

Storage Resource Manager Interface Specification V2.2 Implementations Experience Report

Grid Infrastructure For Collaborative High Performance Scientific Computing

Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland

A short introduction to the Worldwide LHC Computing Grid. Maarten Litmaath (CERN)

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

Benchmarking the ATLAS software through the Kit Validation engine

Lawrence Berkeley National Laboratory Recent Work

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

A copy can be downloaded for personal non-commercial research or study, without prior permission or charge

N. Marusov, I. Semenov

150 million sensors deliver data. 40 million times per second

Deployment and Testing of Storage Management software, for CMS experiment

High Performance Computing Course Notes Grid Computing I

Data Access and Data Management

DIRAC data management: consistency, integrity and coherence of data

Distributed Data Management on the Grid. Mario Lassnig

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

PoS(ACAT2010)039. First sights on a non-grid end-user analysis model on Grid Infrastructure. Roberto Santinelli. Fabrizio Furano.

Parallel Storage Systems for Large-Scale Machines

Knowledge Discovery Services and Tools on Grids

A scalable storage element and its usage in HEP

Testing an Open Source installation and server provisioning tool for the INFN CNAF Tier1 Storage system

UW-ATLAS Experiences with Condor

Data Storage. Paul Millar dcache

ARC integration for CMS

Storage and I/O requirements of the LHC experiments

Storage Resource Sharing with CASTOR.

Clouds in High Energy Physics

WHEN the Large Hadron Collider (LHC) begins operation

Data Transfers Between LHC Grid Sites Dorian Kcira

Distributing storage of LHC data - in the nordic countries

LHCb Computing Resource usage in 2017

EXAM Administration of Symantec Enterprise Vault 10.0 for Exchange. Buy Full Product.

Experience of Data Grid simulation packages using.

Philippe Charpentier PH Department CERN, Geneva

The Grid. Processing the Data from the World s Largest Scientific Machine II Brazilian LHC Computing Workshop

CHIPP Phoenix Cluster Inauguration

Status of KISTI Tier2 Center for ALICE

an Object-Based File System for Large-Scale Federated IT Infrastructures

ALICE Grid Activities in US

Grid Architectural Models

Data transfer over the wide area network with a large round trip time

LCG data management at IN2P3 CC FTS SRM dcache HPSS

Report. Middleware Proxy: A Request-Driven Messaging Broker For High Volume Data Distribution

CERN and Scientific Computing

Patrick Fuhrmann (DESY)

dcache Introduction Course

Computing activities in Napoli. Dr. Silvio Pardi (INFN-Napoli) Belle II Italian collaboration meeting 21 November 2017 Pisa - Italy

File Access Optimization with the Lustre Filesystem at Florida CMS T2

ACCURATE STUDY GUIDES, HIGH PASSING RATE! Question & Answer. Dump Step. provides update free of charge in one year!

The Grid: Processing the Data from the World s Largest Scientific Machine

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Dynamic Federations. Seamless aggregation of standard-protocol-based storage endpoints

The Compact Muon Solenoid Experiment. Conference Report. Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

Transitioning NCAR MSS to HPSS

LHCb Distributed Conditions Database

The CMS Computing Model

Managed Data Storage and Data Access Services for Data Grids

Exploring cloud storage for scien3fic research

ATLAS computing activities and developments in the Italian Grid cloud

Computing for LHC in Germany

Deliverable D8.9 - First release of DM services

Experience of the WLCG data management system from the first two years of the LHC data taking

AGIS: The ATLAS Grid Information System

GRID COMPUTING APPLIED TO OFF-LINE AGATA DATA PROCESSING. 2nd EGAN School, December 2012, GSI Darmstadt, Germany

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan

( PROPOSAL ) THE AGATA GRID COMPUTING MODEL FOR DATA MANAGEMENT AND DATA PROCESSING. version 0.6. July 2010 Revised January 2011

Benchmarking third-party-transfer protocols with the FTS

ASSIGNMENT- I Topic: Functional Modeling, System Design, Object Design. Submitted by, Roll Numbers:-49-70

HPSS Treefrog Summary MARCH 1, 2018

HEP replica management

Grid Data Management

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Transcription:

EWDC 2009 12th European Workshop on Dependable Computing Toulouse, France, 14 15 May, 2009 Experiences in testing a Grid service in a production environment Flavia Donno CERN, European Organization for Nuclear Research Flavia.Donno@cern.ch Andrea Domenici DIIEIT, Università di Pisa Andrea.Domenici@iet.unipi.it

The Worldwide LHC Computing Grid Grid sites in the world (Courtesy of CERN). The Worldwide LHC Computing Grid (WLCG) is one of the largest Grid infrastructures dedicated to high-performance scientific computation, with more than 200 sites all over the world. F. Donno, A. Domenici EWDC09 2/21

Mass Storage Systems A robotic tape library. The Grid uses heterogenous Mass Storage Systems (MSS), based on different technologies and with different capabilities and interfaces. F. Donno, A. Domenici EWDC09 3/21

Storage Elements A Storage Element (SE) is a Grid Service that provides: A mass storage system. A GridFTP service to provide data transfer in and out of the SE to and from the Grid. Local POSIX-like input/output calls providing application access to the data on the SE. Authentication, authorization and audit/accounting facilities. F. Donno, A. Domenici EWDC09 4/21

The Storage Resource Manager The Storage Resource Manager (SRM) is the common interface of the Storage Elements. The SRM specification defines many service requests: Space management functions allow the client to reserve, allocate, release, and manage storage spaces, their types and lifetimes. Data transfer functions have the purpose of getting files into SRM spaces either from the client s space or from other remote storage systems on the Grid, and to retrieve them. Other function classes are Directory, Permission, and Discovery functions. F. Donno, A. Domenici EWDC09 5/21

Testing for SRM compliance The goals of testing are: Validating the SRM interface and protocol specification for adherence to the explicit and implicit user requirements, and against inconsistency, incompleteness, or inefficiency; validating the SRM implementations for compliance with the specification; checking the SRM implementations for performance and reliability. Difficulties arise from: Large and complex set of service requests, informal specification, number of different implementations, and number of sites. F. Donno, A. Domenici EWDC09 6/21

A typical SRM request srmreservespace Input parameters: TRetentionPolicyInfo unsigned long string unsigned long int retentionpolicyinfo desiredsizeofguaranteedspace authorizationid desiredsizeoftotalspace desiredlifetimeofreservedspace TTransferParameters transferparameters... more optional parameters Output parameters: TReturnStatus returnstatus string requesttoken... more optional parameters F. Donno, A. Domenici EWDC09 7/21

Space properties The retentionpolicyinfo parameter specifies two properties of the requested space: Retention policy, likelyhood of file loss: REPLICA, OUTPUT, CUSTODIAL. Access latency, readiness of file access: ONLINE (e.g., disk), NEARLINE (e.g., tape). A storage class is a combination of retention policy and access latency. In the WLCG, the following storage classes are supported: Tape0Disk1: Replica, Online; Tape1Disk1: Custodial, Online; Tape1Disk0: Custodial, Nearline. F. Donno, A. Domenici EWDC09 8/21

A large test space The srmreservespace request has nine input arguments. Some arguments range over a finite set of values. Other arguments range over theoretically infinite sets of values. Equivalence partitioning enables us to reduce the number of values to consider... but we are still left with some 20000 test cases. And then we have the other 38 requests! F. Donno, A. Domenici EWDC09 9/21

Use-case analysis We may shrink the test space by pruning argument values and combinations that may be ruled out based on the actual operating conditions in the WLCG. The SRM specification is very general and flexible: many negotiations are possible; much leeway for implementation or site dependent defaults; allowance for future requirements. The full power of the SRM is currently not used by the implementations... yet they are SRM-compliant. A careful analysis of usage patterns and implementation constraints enables us to significantly reduce the size of the test space. This requires a close interaction between testers, users, and developers. F. Donno, A. Domenici EWDC09 10/21

Reshaping the signature We prune the domain of an argument and eliminate some altogether: retentionpolicyinfo: only a few of the possible values are in use. authorizationid: unused, as in the WLCG credentials are not passed as parameters (certificates are used instead). transferparameters: unused, as site dependent defaults are used. Other parameters (not shown) are similarly ignored. However, we consider the validity or absence of a user certificate as an extra argument. We can then test the request with only five variable arguments, thus reducing the test space size to about 200 cases. F. Donno, A. Domenici EWDC09 11/21

Modeling constraints and conditions (1) Cause-effect graphing is used to derive test cases covering constraints and operating conditions, e.g.: Causes: 1 retentionpolicyinfo is not NULL 2 retentionpolicyinfo is supported by server... 11 requesttoken is returned [11 and 12 mutually exclusive] 12 spacetoken is returned [12 requires 13] 13 sizeofguaranteedreservedspace and lifetimeofreservedspace are returned Effects:... 94 sizeofguaranteedreservedspace = default 95 lifetimeofreservedspace = default 96 transferparameters is ignored F. Donno, A. Domenici EWDC09 12/21

Modeling constraints and conditions (2) 1 ~ 23 90 2 14 3 4 ~ ~ 18 214 5 6 7 8 9 ~ ~ ~ ~ 94 56 95 78 ~ 110 256 278 114 91 10 11 E 12 R 13 ~ 96 112 ~ ~ Cause-effect graph for the srmreservespace request. 92 93 F. Donno, A. Domenici EWDC09 13/21

Error guessing Error guessing = pragmatic knowledge + formalization. Example: formalization of behavior by state machines led to discover unexpected interactions. SURL_Assigned Busy PutDone [retention = CUSTODIAL] PutDone [retention <> CUSTODIAL] BringOnline PrepareToPut [overwrite] ChangeSpaceForFiles ChangeSpaceForFiles Nearline BringOnline Online ReleaseFiles PrepareToGet ReleaseFiles [retention <> CUSTODIAL] AbortRequest Readable AbortFiles ChangeSpaceForFiles PrepareToGet NearlineOnline AbortFiles AbortRequest Partial state machine for a file. F. Donno, A. Domenici EWDC09 14/21

Test case families Five families of test cases have been designed: Availability to check the availability in time of the SRM service end-points. Basic to verify basic functionality of the implemented SRM APIs. Use Cases to check boundary conditions, use cases derived by real usage, function interactions, exceptions, etc. Exhaustion to check extreme values and properties of input and output arguments such as length of filenames, URL format, etc. Stress tests to stress the systems, identify race conditions, study the behavior of the system when critical concurrent operations are performed, etc. F. Donno, A. Domenici EWDC09 15/21

The SRM testbed The following SRM implementations are being tested: CASTOR developed at CERN, uses tape libraries with disk servers as front-end caches. SRM 2.2 implementation developed at RAL (UK) (4 Tier-1 sites). dcache developed at DESY (Germany), uses multiple MSS backends, both custom and proprietary. SRM 2.2 implementation developed at FNAL (USA) (7 Tier-1 sites). DPM developed at CERN, a disk-only MSS. SRM 2.2 implementation developed at CERN (6 Tier-2 sites). DRM/BeStMan is the LBNL (USA) disk-based storage system. LBNL has been the first promoter of SRM, and this storage system was the first prototype on which SRM has been tested (1 Tier-2 sites). StoRM developed at CNAF (Italy), uses parallel file systems such as GPFS or PVFS. (4 Tier-2 sites). F. Donno, A. Domenici EWDC09 16/21

Test execution and analysis Execution framework based on S2 and shell scripts. invoke SRM requests; make checks on return codes; define complex test actions. Automatic execution and result logging six times a day. Monthly plots for each test family. F. Donno, A. Domenici EWDC09 17/21

Pre-production testing Basic Tests Jan 2006 -- Mar 2007 1 CASTORCERN DCACHEFNAL DPMCERN DRMLBNL STORM 0.8 No. of failures/no. of tests 0.6 0.4 0.2 0 11/11/06 11/25/06 12/09/06 12/23/06 01/06/07 01/20/07 02/03/07 02/17/07 03/03/07 03/17/07 Date F. Donno, A. Domenici EWDC09 18/21

In-production testing Deployment Basic Tests May 2008 -- Feb 2009 No. of failures/no. of tests 1 0.8 0.6 0.4 CASTORALICE CASTORASGC CASTORATLAS CASTORCERN CASTORCMS CASTORCNAF CASTORDTEAM CASTORLHCB CASTORPPS DCACHEBNLPR DCACHEDESY DCACHEFZKPR DCACHEIN2P3PR DCACHENDGFPR DCACHEPICPR DCACHESARAPR DCACHESTRESS DCACHETRIUMFPR DCACHEUCSD DPMCERN DPMLAL DPMNIKHEF DRMLBNL 0.2 0 06/01/08 07/01/08 08/01/08 09/01/08 10/01/08 11/01/08 12/01/08 01/01/09 02/01/09 Date F. Donno, A. Domenici EWDC09 19/21

Conclusions A complex Grid service such as the SRM poses a challenge to testers. Standard testing techniques are fundamental... but cannot be applied mechanically. Testers, users, and developers cannot live on different planets. The development of a (semi)formal model has helped design a few families of tests. The testing campaign itself has motivated the developers to reconsider many of the initial assumptions and decisions, leading to solutions that seem to better satisfy the needs of the users. F. Donno, A. Domenici EWDC09 20/21

Thank you Merci F. Donno, A. Domenici EWDC09 21/21