Experiences in testing a Grid service in a production environment

EWDC 2009 12th European Workshop on Dependable Computing Toulouse, France, 14 15 May, 2009 Experiences in testing a Grid service in a production environment Flavia Donno CERN, European Organization for Nuclear Research Flavia.Donno@cern.ch Andrea Domenici DIIEIT, Università di Pisa Andrea.Domenici@iet.unipi.it

The Worldwide LHC Computing Grid Grid sites in the world (Courtesy of CERN). The Worldwide LHC Computing Grid (WLCG) is one of the largest Grid infrastructures dedicated to high-performance scientific computation, with more than 200 sites all over the world. F. Donno, A. Domenici EWDC09 2/21

Mass Storage Systems A robotic tape library. The Grid uses heterogenous Mass Storage Systems (MSS), based on different technologies and with different capabilities and interfaces. F. Donno, A. Domenici EWDC09 3/21

Storage Elements A Storage Element (SE) is a Grid Service that provides: A mass storage system. A GridFTP service to provide data transfer in and out of the SE to and from the Grid. Local POSIX-like input/output calls providing application access to the data on the SE. Authentication, authorization and audit/accounting facilities. F. Donno, A. Domenici EWDC09 4/21

The Storage Resource Manager The Storage Resource Manager (SRM) is the common interface of the Storage Elements. The SRM specification defines many service requests: Space management functions allow the client to reserve, allocate, release, and manage storage spaces, their types and lifetimes. Data transfer functions have the purpose of getting files into SRM spaces either from the client s space or from other remote storage systems on the Grid, and to retrieve them. Other function classes are Directory, Permission, and Discovery functions. F. Donno, A. Domenici EWDC09 5/21

Testing for SRM compliance The goals of testing are: Validating the SRM interface and protocol specification for adherence to the explicit and implicit user requirements, and against inconsistency, incompleteness, or inefficiency; validating the SRM implementations for compliance with the specification; checking the SRM implementations for performance and reliability. Difficulties arise from: Large and complex set of service requests, informal specification, number of different implementations, and number of sites. F. Donno, A. Domenici EWDC09 6/21

A typical SRM request srmreservespace Input parameters: TRetentionPolicyInfo unsigned long string unsigned long int retentionpolicyinfo desiredsizeofguaranteedspace authorizationid desiredsizeoftotalspace desiredlifetimeofreservedspace TTransferParameters transferparameters... more optional parameters Output parameters: TReturnStatus returnstatus string requesttoken... more optional parameters F. Donno, A. Domenici EWDC09 7/21

Space properties The retentionpolicyinfo parameter specifies two properties of the requested space: Retention policy, likelyhood of file loss: REPLICA, OUTPUT, CUSTODIAL. Access latency, readiness of file access: ONLINE (e.g., disk), NEARLINE (e.g., tape). A storage class is a combination of retention policy and access latency. In the WLCG, the following storage classes are supported: Tape0Disk1: Replica, Online; Tape1Disk1: Custodial, Online; Tape1Disk0: Custodial, Nearline. F. Donno, A. Domenici EWDC09 8/21

A large test space The srmreservespace request has nine input arguments. Some arguments range over a finite set of values. Other arguments range over theoretically infinite sets of values. Equivalence partitioning enables us to reduce the number of values to consider... but we are still left with some 20000 test cases. And then we have the other 38 requests! F. Donno, A. Domenici EWDC09 9/21

Use-case analysis We may shrink the test space by pruning argument values and combinations that may be ruled out based on the actual operating conditions in the WLCG. The SRM specification is very general and flexible: many negotiations are possible; much leeway for implementation or site dependent defaults; allowance for future requirements. The full power of the SRM is currently not used by the implementations... yet they are SRM-compliant. A careful analysis of usage patterns and implementation constraints enables us to significantly reduce the size of the test space. This requires a close interaction between testers, users, and developers. F. Donno, A. Domenici EWDC09 10/21

Reshaping the signature We prune the domain of an argument and eliminate some altogether: retentionpolicyinfo: only a few of the possible values are in use. authorizationid: unused, as in the WLCG credentials are not passed as parameters (certificates are used instead). transferparameters: unused, as site dependent defaults are used. Other parameters (not shown) are similarly ignored. However, we consider the validity or absence of a user certificate as an extra argument. We can then test the request with only five variable arguments, thus reducing the test space size to about 200 cases. F. Donno, A. Domenici EWDC09 11/21

Modeling constraints and conditions (1) Cause-effect graphing is used to derive test cases covering constraints and operating conditions, e.g.: Causes: 1 retentionpolicyinfo is not NULL 2 retentionpolicyinfo is supported by server... 11 requesttoken is returned [11 and 12 mutually exclusive] 12 spacetoken is returned [12 requires 13] 13 sizeofguaranteedreservedspace and lifetimeofreservedspace are returned Effects:... 94 sizeofguaranteedreservedspace = default 95 lifetimeofreservedspace = default 96 transferparameters is ignored F. Donno, A. Domenici EWDC09 12/21

Modeling constraints and conditions (2) 1 ~ 23 90 2 14 3 4 ~ ~ 18 214 5 6 7 8 9 ~ ~ ~ ~ 94 56 95 78 ~ 110 256 278 114 91 10 11 E 12 R 13 ~ 96 112 ~ ~ Cause-effect graph for the srmreservespace request. 92 93 F. Donno, A. Domenici EWDC09 13/21

Error guessing Error guessing = pragmatic knowledge + formalization. Example: formalization of behavior by state machines led to discover unexpected interactions. SURL_Assigned Busy PutDone [retention = CUSTODIAL] PutDone [retention <> CUSTODIAL] BringOnline PrepareToPut [overwrite] ChangeSpaceForFiles ChangeSpaceForFiles Nearline BringOnline Online ReleaseFiles PrepareToGet ReleaseFiles [retention <> CUSTODIAL] AbortRequest Readable AbortFiles ChangeSpaceForFiles PrepareToGet NearlineOnline AbortFiles AbortRequest Partial state machine for a file. F. Donno, A. Domenici EWDC09 14/21

Test case families Five families of test cases have been designed: Availability to check the availability in time of the SRM service end-points. Basic to verify basic functionality of the implemented SRM APIs. Use Cases to check boundary conditions, use cases derived by real usage, function interactions, exceptions, etc. Exhaustion to check extreme values and properties of input and output arguments such as length of filenames, URL format, etc. Stress tests to stress the systems, identify race conditions, study the behavior of the system when critical concurrent operations are performed, etc. F. Donno, A. Domenici EWDC09 15/21

The SRM testbed The following SRM implementations are being tested: CASTOR developed at CERN, uses tape libraries with disk servers as front-end caches. SRM 2.2 implementation developed at RAL (UK) (4 Tier-1 sites). dcache developed at DESY (Germany), uses multiple MSS backends, both custom and proprietary. SRM 2.2 implementation developed at FNAL (USA) (7 Tier-1 sites). DPM developed at CERN, a disk-only MSS. SRM 2.2 implementation developed at CERN (6 Tier-2 sites). DRM/BeStMan is the LBNL (USA) disk-based storage system. LBNL has been the first promoter of SRM, and this storage system was the first prototype on which SRM has been tested (1 Tier-2 sites). StoRM developed at CNAF (Italy), uses parallel file systems such as GPFS or PVFS. (4 Tier-2 sites). F. Donno, A. Domenici EWDC09 16/21

Test execution and analysis Execution framework based on S2 and shell scripts. invoke SRM requests; make checks on return codes; define complex test actions. Automatic execution and result logging six times a day. Monthly plots for each test family. F. Donno, A. Domenici EWDC09 17/21

Pre-production testing Basic Tests Jan 2006 -- Mar 2007 1 CASTORCERN DCACHEFNAL DPMCERN DRMLBNL STORM 0.8 No. of failures/no. of tests 0.6 0.4 0.2 0 11/11/06 11/25/06 12/09/06 12/23/06 01/06/07 01/20/07 02/03/07 02/17/07 03/03/07 03/17/07 Date F. Donno, A. Domenici EWDC09 18/21

In-production testing Deployment Basic Tests May 2008 -- Feb 2009 No. of failures/no. of tests 1 0.8 0.6 0.4 CASTORALICE CASTORASGC CASTORATLAS CASTORCERN CASTORCMS CASTORCNAF CASTORDTEAM CASTORLHCB CASTORPPS DCACHEBNLPR DCACHEDESY DCACHEFZKPR DCACHEIN2P3PR DCACHENDGFPR DCACHEPICPR DCACHESARAPR DCACHESTRESS DCACHETRIUMFPR DCACHEUCSD DPMCERN DPMLAL DPMNIKHEF DRMLBNL 0.2 0 06/01/08 07/01/08 08/01/08 09/01/08 10/01/08 11/01/08 12/01/08 01/01/09 02/01/09 Date F. Donno, A. Domenici EWDC09 19/21

Conclusions A complex Grid service such as the SRM poses a challenge to testers. Standard testing techniques are fundamental... but cannot be applied mechanically. Testers, users, and developers cannot live on different planets. The development of a (semi)formal model has helped design a few families of tests. The testing campaign itself has motivated the developers to reconsider many of the initial assumptions and decisions, leading to solutions that seem to better satisfy the needs of the users. F. Donno, A. Domenici EWDC09 20/21

Thank you Merci F. Donno, A. Domenici EWDC09 21/21