Recent Evolutions of GridICE: a Monitoring Tool for Grid Systems

Similar documents
Comparative evaluation of software tools accessing relational databases from a (real) grid environments

A RESOURCE MANAGEMENT FRAMEWORK FOR INTERACTIVE GRIDS

Design principles of a web interface for monitoring tools

EGEE and Interoperation

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

glite Grid Services Overview

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

Testing SLURM open source batch system for a Tierl/Tier2 HEP computing facility

PoS(EGICF12-EMITC2)081

Grid Architectural Models

The INFN Tier1. 1. INFN-CNAF, Italy

UNICORE Globus: Interoperability of Grid Infrastructures

GStat 2.0: Grid Information System Status Monitoring

Grids and Security. Ian Neilson Grid Deployment Group CERN. TF-CSIRT London 27 Jan

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

An Evaluation of Alternative Designs for a Grid Information Service

Grid Scheduling Architectures with Globus

Andrea Sciabà CERN, Switzerland

EUROPEAN MIDDLEWARE INITIATIVE

On the employment of LCG GRID middleware

MONITORING OF GRID RESOURCES

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

Monitoring tools in EGEE

Introduction to Grid Infrastructures

Scalable Computing: Practice and Experience Volume 10, Number 4, pp

ShibVomGSite: A Framework for Providing Username and Password Support to GridSite with Attribute based Authorization using Shibboleth and VOMS

ELFms industrialisation plans

Monitoring ARC services with GangliARC

GridMonitor: Integration of Large Scale Facility Fabric Monitoring with Meta Data Service in Grid Environment

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

The LHC Computing Grid. Slides mostly by: Dr Ian Bird LCG Project Leader 18 March 2008

Grid Infrastructure For Collaborative High Performance Scientific Computing

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

High Performance Computing Course Notes Grid Computing I

The LHC Computing Grid

Interconnect EGEE and CNGRID e-infrastructures

Argus Vulnerability Assessment *1

ISTITUTO NAZIONALE DI FISICA NUCLEARE

R-GMA (Relational Grid Monitoring Architecture) for monitoring applications

The Grid Monitor. Usage and installation manual. Oxana Smirnova

Introduction to Grid Technology

QosCosGrid Middleware

g-eclipse A Contextualised Framework for Grid Users, Grid Resource Providers and Grid Application Developers

G-ECLIPSE: A MIDDLEWARE-INDEPENDENT FRAMEWORK TO ACCESS AND MAINTAIN GRID RESOURCES

A Simulation Model for Large Scale Distributed Systems

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Outline. Infrastructure and operations architecture. Operations. Services Monitoring and management tools

The LHC Computing Grid

Easy Access to Grid Infrastructures

MONitoring Agents using a Large Integrated Services Architecture. Iosif Legrand California Institute of Technology

Testing an Open Source installation and server provisioning tool for the INFN CNAF Tier1 Storage system

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

Implementing GRID interoperability

Multi-threaded, discrete event simulation of distributed computing systems

The glite middleware. Ariel Garcia KIT

WP3 Final Activity Report

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Resource Allocation in computational Grids

ALHAD G. APTE, BARC 2nd GARUDA PARTNERS MEET ON 15th & 16th SEPT. 2006

Information and monitoring

Table of Contents Chapter 1: Migrating NIMS to OMS... 3 Index... 17

Integration of Cloud and Grid Middleware at DGRZR

( PROPOSAL ) THE AGATA GRID COMPUTING MODEL FOR DATA MANAGEMENT AND DATA PROCESSING. version 0.6. July 2010 Revised January 2011

Formalization of Objectives of Grid Systems Resources Protection against Unauthorized Access

Monitoring the ALICE Grid with MonALISA

GlobalWatch: A Distributed Service Grid Monitoring Platform with High Flexibility and Usability*

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

vrealize Operations Manager Customization and Administration Guide vrealize Operations Manager 6.4

Garuda : The National Grid Computing Initiative Of India. Natraj A.C, CDAC Knowledge Park, Bangalore.

GRIDS INTRODUCTION TO GRID INFRASTRUCTURES. Fabrizio Gagliardi

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

Benchmarking the ATLAS software through the Kit Validation engine

The ALICE Glance Shift Accounting Management System (SAMS)

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

Understanding StoRM: from introduction to internals

AGIS: The ATLAS Grid Information System

Status of KISTI Tier2 Center for ALICE

CernVM-FS beyond LHC computing

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

igrid: a Relational Information Service A novel resource & service discovery approach

LHCb Distributed Conditions Database

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

VMs at a Tier-1 site. EGEE 09, Sander Klous, Nikhef

HPC Metrics in OSCAR based on Ganglia

A short introduction to the Worldwide LHC Computing Grid. Maarten Litmaath (CERN)

DIRAC pilot framework and the DIRAC Workload Management System

YAIM Overview. Bruce Becker Meraka Institute. Co-ordination & Harmonisation of Advanced e-infrastructures for Research and Education Data Sharing

CMS users data management service integration and first experiences with its NoSQL data storage

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

Grid-Based Data Mining and the KNOWLEDGE GRID Framework

Parallel Computing in EGI

The Legnaro-Padova distributed Tier-2: challenges and results

3rd UNICORE Summit, Rennes, Using SAML-based VOMS for Authorization within Web Services-based UNICORE Grids

Bob Jones. EGEE and glite are registered trademarks. egee EGEE-III INFSO-RI

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

The Grid: Processing the Data from the World s Largest Scientific Machine

Transcription:

Recent Evolutions of GridICE: a Monitoring Tool for Grid Systems Cristina Aiftimiei INFN-Padova Padova, Italy cristina.aiftimiei@pd.infn.it Vihang Dudhalkar INFN-Bari and Dipartimento di Fisica - Politecnico di Bari vihang007@gmail.com Giorgio Maggi INFN-Bari and Dipartimento di Fisica - Politecnico di Bari giorgio.maggi@ba.infn.it Sergio Andreozzi INFN-CNAF Bologna, Italy sergio.andreozzi@cnaf.infn.it Giacinto Donvito INFN-Bari giacinto.donvito@ba.infn.it Sergio Fantinel INFN-Padova/Legnaro Padova, Italy sergio.fantinel@lnl.infn.it Antonio Pierro INFN-Bari antonio.pierro@ba.infn.it Guido Cuscela INFN-Bari guido.cuscela@ba.infn.it Enrico Fattibene INFN-CNAF Bologna, Italy enrico.fattibene@cnaf.infn.it Giuseppe Misurelli INFN-CNAF Bologna, Italy giuseppe.misurelli@cnaf.infn.it ABSTRACT Grid systems must provide its users with precise and reliable information about the status and usage of available resources. The efficient distribution of this information enables Virtual Organizations (VOs) to optimize the utilization strategies of their resources and to complete the planned computations. In this paper, we describe the recent evolution of GridICE, a monitoring tool for Grid systems. Such evolutions are targeted at satisfying the requirements from the main categories of users: Grid operators, site administrators, Virtual Organization (VO) managers and Grid users. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics performance measures on leave from NIPNE-HH, Bucharest, Romania contact author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC 07, June 25 29, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-673-8/07/0006...$5.00. General Terms Measurement, Performance, Management Keywords Grid computing, monitoring, measurement, data quality, performance analysis 1. INTRODUCTION Grid computing is concerned with the virtualization, integration and management of services and resources in a distributed, heterogeneous environment that supports collections of users and resources across traditional administrative and organizational domains [24]. One aspect of particular importance is Grid monitoring, that is the activity of measuring significant Grid resourcerelated parameters in order to analyze status, usage, behavior and performance of a Grid system. Grid monitoring helps also in the detection of faulty situations, contract violations and user-defined events. Two main types of monitoring can be identified: infrastructure monitoring and application monitoring. The former aims at collecting information about Grid resources; it can also maintain the history of observations in order to perform retrospective analysis. The latter aims at enabling the observation of a particular execution of an application; the collected data can be useful to the application development activity or for visualizing its behavior when running in a machine with no login right access (i.e., the typical Grid use case). In this area, GridICE [14], an open source distributed monitoring tool for Grid systems, provides the full coverage

of the first type. The project was started in late 2002 within the EU-DataTAG [8] project and is evolving in the context of EU-EGEE [9] and related projects. GridICE is fully integrated with the glite middleware [15], in fact its metering service and publishing of the measured data can be configured via the glite installation mechanisms. GridICE is designed to serve different categories of monitoring information consumers and different aggregation dimensions such the Grid operators, VO managers and site administrators are provided. Recent work has been devoted to the addition of a group level aggregation granularity based on the privilege attributes associated to a user, i.e., groups and roles provided by the Virtual Organization Membership Service (VOMS) [2]. This capability is an important step towards the support of a correct group-based resource allocation of Virtual Organization resources. By means of this feature, the relevant persons can observe the VO activity as a whole or drill down through its groups/roles up to the individual users. This paper is organized as follows: Section 2 presents the GridICE architecture and its main features; Section 3 focuses on the new sensors; Section 4 discusses about data quality and related tests; finally, Section 5 draws up the conclusions together with directions for future work. 2. GRIDICE ARCHITECTURE AND IMPLEMENTATION In this section, we summarize the architecture and relevant implementation choices related to GridCE (a detailed description can be found in [1, 3]). GridICE consists of three main components (see Figure 1). The sensors, performing the measurement process on the monitored entities; the site collector aggregating the information produced by the different sensors installed within a site domain for the publication using the Grid Information Service; the server performing several functions: (1) discovery of new available resources to be monitored; (2) periodical observation of the information published by the sites; (3) storage of the observed information in a relational database; (4) presentation of the collected information by means of HTML pages, XML documents and charts; they can be accessed by end-users or consumed by other automated tools; (5) processing of the information in order to identify malfunctions and send the appropriate notifications to the subscribed users. A key design choice of GridICE is that all monitoring information should be distributed outside each site using the Grid Information Service interface. In glite, this service is based on the Lightweight Directory Access Protocol (LDAP) [25] and is structured in a hierarchical set of servers based on different implementations: the MDS GRIS [7] for the leaf nodes and the OpenLDAP server [21] with the Berkeley Database [5] as a back-end for the intermediate and root nodes. In the context of glite, the latter implementation is called BDII (Berkeley Database Information Index). Given the above design principle (monitoring information distributed via the Grid Information Service), a GridICE server is able to gather information from an up-to-date list of information sources by a two-step process: (1) a periodical query is performed on a set of root nodes (BDII) configured by the GridICE server administrator; the output is compared with the current list of information sources maintained Figure 1: GridICE Architecture by the GridICE server and new or disappeared sources are detected; an up-to-date list is defined; (2) starting from this list of active sources of monitoring data and according to a number of parameters set by the server administrator, a new configuration is generated for the scheduler responsible for the periodical run of plug-ins which purpose is to read the data advertised by a certain data source, compare it with the content of the persistent storage and update of the related information. The typical frequency of the discovery process is once a day while the frequency of the plug-ins depends on the type of services to which the monitoring information refers to (given the default configuration, they vary from a minimum of 5 minutes to a maximum of 30 minutes). 2.1 LEMON and GridICE synergies The GridICE architecture (see Figure 1) can be considered a 2-level hierarchical model: the intra-site level concerns the domain of an administrative site and aims at measuring and collecting the monitoring data in a single logical repository; the inter-site level concerns distribution of monitoring data across sites and enables the Grid-wide access to the site repositories. This is an important design choice since it enforces the concept of domain ownership, that is, it clearly defines the domain boundaries so that site administrators can govern what information is offered outside their managed domain. In the intra-site level, the transportation of data is typically performed by a fabric monitoring service, while in the inter-site level, the transportation of data relies on the Grid Information Service. The two levels are totally decoupled and, in principle, different fabric monitoring services can be adapted to measure and locally collect monitoring data. The default option proposed by GridICE is LEMON [16], a fabric monitoring tool developed at CERN. This tool is scalable and offers a rich set of sensors for fabric-related attributes such as CPU load, memory utilization and available space in the file system. Via the plug-in system, GridICEspecific sensors are configured and the related information is measured and collected. By means of a transformation adapter (fmon2glue), data stored in the local repository is translated into the LDAP Data Interchange format (LDIF) [13] and is injected into a special instance of a GRIS [7] that acts as a site publisher (see Figure 1). The data published by this GRIS is not propagated up to the Grid Information

Service hierarchy in order to avoid the overloading of the higher-level nodes. Instead of propagating the monitoring data, the URL (Unified Resource Locator) of this special GRIS is advertised so that the GridICE server can discover its existence and directly pull the information. The current GridICE release includes an old version of LEMON (v.2.5.4). Recently, a substantial effort was spent to integrate a new version (v.2.13.x). Thanks to this improvement, both GridICE and the local monitoring can exploit a set of new features. For instance, site administrators will have the possibility to observe the activity of their farm by using the LEMON RRD Framework (LRF that provides a Web-based interface to more detailed fabric-related information that are not meaningful to be exported at the Grid level. This upgrade required modifications of the whole set of GridICE-specific sensors in order to comply with the new naming conventions. Furthermore, some old sensor developed within GridICE was dropped in favor of new ones provided by LEMON. Finally, a significant rewriting of the fmon2glue component was required. The upgraded version of GridICE sensors is under testing at the INFN-BARI site since the beginning of February 2007. 3. THE NEW SENSORS The measurements that are managed by GridICE are an extension of those defined in the GLUE Schema [4]. Sensors related to the attributes defined in this schema are part of the glite middleware and the related information are collected by querying the Grid Information Service. The extensions to the GLUE Schema concern: (1) fabric-level information; (2) individual Grid jobs; (3) aggregated information of the Local Resource Manager System (LRMS). In the near future, we plan to include new sensors concerning the glite Workload Management Service (WMS), file transfer and file access. In Section 3.1, we describe the features of the LRM- SInfo sensor, while in Section 3.2 we report on the sensor related to the individual job monitoring. Finally, in Section 3.3, we list the improvements of the Web presentation in order to consider the new available measurements. 3.1 The LRMSInfo Sensor The LRMSInfo sensor provides aggregated information of the Local Resource Manager System and was released with GridICE in September 2006. Different incarnations of this sensor exist for the relevant batch systems used in the EGEE: TORQUE/Maui [23, 19] and LSF [18]. The attributes being measured by this sensor are: the number of available CPUs accessible by Grid jobs (CPUs exclusively associated to queues not interfaced to a Grid will not be counted); the number of used CPUs; the number of off-line CPUs; the average farm load; the total, available and used memory; the number of running and waiting jobs (see Figure 2). This information is used on the server side to provide a preliminary Service Level Agreement (SLA) monitoring support. As opposed to the information available from glite basic information providers, the LRMSinfo provides an aggregate view avoiding double counting of resources; this is a well-known problem caused by the publication of per-queue information of resources available to many queues. The LRMSInfo sensor requires one installation per LRMS server (batch system master node) and it normally runs in the Grid head node. The sensor is not intrusive since it requires just few seconds to complete the measurement pro- Figure 2: Information produced by the LRMSInfo sensor cess. As an example, we measured that in one of the Italian Tier-1 (INFN-T1) access nodes having a dual-intel(r) Xeon(TM) CPU 3.06Ghz and 3,700 managed jobs (running + queued), a typical execution time for the sensor is: real 10.603s, user 6.880s, sys 0.280s. Normally, the script is scheduled every 2 minutes, but the time interval can be adjusted as preferred. 3.2 The Job Monitoring Sensor Grid users want to know the status of their jobs on a Grid infrastructure, while site administrators want to know who is running what and where. This need is fulfilled by the GridCE Job monitoring sensor and the measured attributes are described in Table 1. Implementing this sensor presented a number of challenges to be faces. First of all, the required information is spread in many sources (e.g., log files, batch system API). Secondly, sites supporting large numbers of running and queued jobs present scalability concerns due to the need of interacting with the batch system and accessing a number of log files. Several invocations of system commands have to be performed for each sensor run with the risk of affecting the performance of other services. For these reasons, the approach to the design of this sensor changed over time in order to reduce its intrusiveness in terms of resource consumption in a shared execution environment. In the latest GridICE release, this sensor is composed of two daemons and a probe to be executed periodi-

Field NAME JOB ID GRID ID USER VO QUEUE QTIME START END STATUS CPUTIME WALLTIME MEMORY VMEMORY EXEC HOST EXIT STATUS SUBJECT VOMS ROLES Description Job name Local LRMS job id Grid job unique id Local mapped username User VO name Queue Job creation time Job start time Job end time Job status Job CPU usage time Job walltime Job memory usage Job virtual memory usage Execution host (WN) Batch system exit status User DN User VOMS roles Table 1: Attributes measured by the Job Monitoring sensor cally, all installed at the Grid head node of a farm. By the current design, we are able to efficiently measure the job information also in large sites. As an example, in the last months the INFN-T1 handling thousands of jobs was monitored without creating any load problem. This goal was achieved by stateful-based strategy among different executions of the sensor. The two daemons written in Perl and C programming languages listen to a set of log files and collect the relevant information. For each run of the probe, this information is correlated to the output of a few LRMS commands invocation and the status of all jobs is stored in a cache. Subsequent executions of the probe perform an update of the already available information, if needed (e.g., the data related to jobs staying in a queue for long time do not cause any update beside the initial state change of being in the queue). In one of the INFN-T1 Grid access nodes equipped with a dual-intel(r) Xeon(TM) CPU 3.06Ghz, we have measured a time lower than one minute (15s user, 15s system) for measuring information of 500 jobs. Normally, the probe is executed every 20 minutes, but the time interval can be adjusted as preferred. As regards the daemon parsing the accounting log for the LSF batch system, an execution time of 2.5s (1.8s user, 0.2s system) was measured in order to retrieve the needed information from a log file of 37MB related to 54,808 jobs. This test was performed on an dual-intel(r) Xeon(TM) CPU 2.80GHz. There are security and privacy aspects related to the job monitoring sensor since this is able to measure the Grid identity of the user submitting a job. For security reasons, the measuring of this attribute is disabled by default and can be enabled by the site administrators in each site. The GridICE server is being evolved in order to show sensitive information only to the authorized persons. The Grid-wide data distribution requires the evolution of the Grid Information Service in order to provide the necessary level of privacy. 3.3 Improvements of the Web Presentation The Web presentation relies on an XML abstraction built on top of the data stored in the PostgreSQL database management system running as part of the GridICE server. Figure 3: Sites part of the Italy region Figure 4: Usage Metering The XML documents are transformed into HTML pages via XSLT transformations. A rich set of charts generated via the JpGraph library is also available. The design of the presentation is based on requirements from Grid operators, VO managers, Site administrators and End-users. For Grid operators, summary views about aspects such information resources status, services and host status are provided. Site administrators can appreciate the job monitoring capability showing the status and computing activity of the jobs accepted in the managed resources. VO managers and End-users can rely on GridICE in order to verify the available resources and their status before starting the submission of a huge number of jobs and, after the submission, to follow the jobs execution. Grid operators and site administrators also rely on the notification capability to drive their attention towards emerging problems. The following new features were included in GridICE server release published in September 2006: view the monitored sites part of a certain region (see Figure 3); the integration of information from the EGEE GOC database (e.g., scheduled downtime); LRMSInfo-based information and preliminary SLA support (see Figures 4, 5 and 6); new statistical plots with improved look&feel. Further work is being carried out to properly handle the VOMS information (groups and roles of users submitting

Figure 6: Bari Farm usage monitoring provided by GridICE (proof of concept of a monitoring with the group/role details) Figure 7: Table showing its own jobs to the user identified by means of its browser certificate Figure 5: Biomed VO activity as shown to a user signed as biomed VO manager the jobs) gathered by the new version of the job monitoring sensors (not yet released). There are two main goals for this extension. The first goal is to provide reports about the resource usage with the details of the VOMS groups, roles and users in contrast with the current situation where only VO-related information can be provided. This is a precise requirement of the BioinfoGRID project [6] since the BioinfoGRID users belong to the BionfoGRID group part of the biomed VO. Being able to track the user group would allow to distinguish the activity of the BioinfoGRID project by the activity of the rest of the biomed VO. The second goal is to be able to select the information presented according to the consumer identity, its group and/or role (see Figure 7). To properly handle this use case, the GridICE server will be configured and extended in order to retrieve the user identity from the digital certificate installed in its browser. By using the identity, the related role (e.g., site manager, none ) can be retrieved from the GOC database. Sensitive information will be showed only to those having the right credentials. For instance, in case of a site manager, the identity of the user submitting job to that particular site will be showed. Clearly, the GridICE server will guarantee at the same time unauthenticated access as well as authenticated one. 4. ABOUT DATA QUALITY The quality of a Grid monitoring system depends on many factors. Sensors have to be not only non-intrusive, but also they have to meet requirements on data quality in terms of trustworthiness, objectiveness and easy of use. The different stages of transportation from sources to the central server have to preserve their quality. In this section, we present

Batch system GridICE Present Not Present Total Efficiency Present 13,032 13 13,045 99.9% Not present 1,705-1,705 - Total 14,737 13 14,750 - Table 2: Comparison of the number of observed jobs provided by GridICE against those provided by the TORQUE/Maui batch system in the INFN-Pisa farm (Jan 1, 2007 - Feb 24, 2007) 1,705 refer to jobs killed while still in the queue, therefore they were correctly recorded by GridICE and not observed by the PBS Tools. 3 out of 1,705 refer to jobs for which GridICE recorded uncomplete information (that is, the job information disappeared from the information service before a valid final state was observed). Considering the 13,032 jobs that were observed by both the PBS Tools and GridICE, only 21 showed some difference in one of the related attribute values. The comparison about the accounted CPU and wall time is presented in Table 3. The quality of GridICE data is even better than the already good one derived from the number of jobs. GridICE Batch system GridICE batchsystem Wall Time 211,446,697 211,319,993 99.94% CPU time 162,369,953 162,257,017 99.93% Table 3: Comparison of the wall and the CPU time provided by GridICE against those provided by the TORQUE/Maui batch system in the INFN- Pisa farm (Jan 1, 2007 - Feb 24, 2007) Figure 8: Set-up for job monitoring test and debugging the analysis performed on collected data during a period of more then two months of operation. For this test, we consider as truth what is contained in the batch system log files. However, since the direct handling of batch system log files is heavy and not handy, we opted to rely on refined data extracted by means of other tools. As regards the TORQUE/Maui batch system, we created a relational database using PBS tools by Ohio Supercomputer Center [22]. This tool does not consider the jobs canceled before the start of their execution. The PBS tools were slightly modified in order to store the data in a remote database located in Bari for all the farms included in the test as shown in Figure 8. In this way, we were able to compare the information for each recorded job by the batch system with those measured by GridICE. Concerning LSF, since we did not find a similar tool, we developed our own solution in order to generate the relational database. This task was done only for aggregated information. The data quality analysis focused on two Grid sites: INFN- Pisa, providing 70 cores, handling around 7,500 jobs per month and running the TORQUE/Maui local resource manager; INFN-LNL-2, providing 202 cores and running the LSF local resource manager. For the INFN-Pisa site, Table 2 summarizes the number of observed jobs in the period from 1st January 2007 up to 24th February 2007. We found 1,705 jobs present in GridICE that were not present in the batch system records. 1,702 out of Concerning the INFN-LNL-2 site, the first step was to validate our script used to generate the relational database with aggregated job monitoring information. In Table 4, we show its accuracy considering jobs observed in the period from 4th January 2007 up to 3rd February 2007. The test demonstrates a precision better than one part out of ten thousands. Our script LSF bacct Our script LSF bacct Done jobs 9,034 Exited jobs 659 Total jobs 9,693 9,693 100% CPU time 143,602,112 143,594,851 99.995% Wall time 170,728,730 Average turnaround 35,870 35,870 100% Table 4: Comparison of the job information obtained by means of the bacct LSF command with those provided by our script in the INFN-LNL-2 farm (Jan 4, 2007 - Feb 3, 2007) Table 5 compares the number of observed jobs in the INFN-LNL-2 farm during the period from 1st January 2007 to 30st January 2007. As regards the 1,203 jobs present in GridICE and not in the LSF logs, more than 70% refer to exit code different than zero or to jobs canceled by the user. As regards the 16,133 jobs observed by both the batch system and GridICE, 226 (less then 1,5%) showed some difference in one of the related attribute values. The comparison of the wall and CPU time accounted by GridICE and the LSF batch system is shown in Table 6. The LSF sensor appears to be less accurate than the cor-

Batch system GridICE Present Not Present Total Efficiency Present 16,133 47 16,180 99.7% Not present 1,203-1,203 - Total 17,336 47 17,383 - Table 5: Comparison of the number of observed jobs provided by GridICE against those provided by LSF running on the INFN-LNL-2 farm (Jan 1, 2007 - Jan 30, 2007) GridICE Batch system GridICE batchsystem Wall Time 372,481,145 371,564,665 99.75% CPU time 340,330,602 339,659,678 99.80% Table 6: Comparison of the used wall and the CPU time on the INFN-LNL-2 farm (Jan 1, 2007 - Jan 30, 2007) responding TORQUE/Maui one. The above tests helped us to better tune them, therefore we expect to obtain better results in the near future. As regards tests on large sites, the INFN-T1 relies on the LSF batch system in order to manage 2,500 cores. We have not yet performed data quality tests at this site, nevertheless we compared the number of Grid jobs observed by the INFN-T1 local monitoring system with those observed by GridICE. In Figure 9, we show the comparison of the two measurements over a period of a week obtained by summing up the number of individual observed jobs. The difference between the two values at a given time is satisfactory for most of the week. In the last part of the monitored week, we observed a deficit in the number of jobs monitored by GridICE. The reason was traced thanks to GridICE fabric monitoring and its notification capabilities; one of the INFN- T1 Grid head nodes was not properly working, thus causing the loss of the information for all the handled jobs by that particular machine. 5. CONCLUSIONS In this paper, we have presented the recent evolution of GridICE, a monitoring tool for Grid systems. Such evolutions mainly focused on the improvement in stability and reliability of the whole system, the introduction of new sensors and extension of the Web presentation. Detailed description of the motivations and the issues in evolving these features were provided with particular attention to the job monitoring and batch system information collection. A data quality analysis was performed in a production environment in order to investigate the trustworthiness of the data being measured and collected in the GridICE server. The results confirmed that GridICE meets the expected level of correctness. Future work is targeted at the proper handling of VOMS attributes attached to a user proxy certificate. Furthermore, the integration of new sensors related to the glite WMS, the monitoring of file access and transfer are envisioned. 6. ACKNOWLEDGMENTS We would like to thank the funding projects BioinfoGRID [6], EGEE [9], EUChinaGrid [10], EU-IndiaGrid [11], EU- MedGrid [12], LIBI [17], OMII-Europe [20] for supporting our work. Also many thanks to the LEMON team for their Figure 9: Comparison of GridICE job monitoring and LSF local job monitoring on the INFN-T1 farm fruitful collaboration and promptly support. This work makes use of results produced by the Enabling Grids for E-sciencE (EGEE) project, a project co-funded by the European Commission (under contract number INFSO- RI-031688) through the Sixth Framework Programme. EGEE brings together 91 partners in 32 countries to provide a seamless Grid infrastructure available to the European research community 24 hours a day. 7. ADDITIONAL AUTHORS 8. REFERENCES [1] C. Aiftimiei, S. Andreozzi, G. Cuscela, N. De Bortoli, G. Donvito, E. Fattibene, G. Misurelli, A. Pierro, G. Rubini, and G. Tortone. Gridice: Requirements, Architecture and Experience of a Monitoring Tool for Grid Systems. In In Proceedings of the International Conference on Computing in High Energy and Nuclear Physics (CHEP2006), Mumbai, India, Feb 2006. [2] R. Alfieri, R. Cecchini, V. Ciaschini, L. dell Agnello, A. Frohner, A. Gianoli, K. Lörentey, and F. Spataro. VOMS, an Authorization System for Virtual Organizations. Proceedings of the 1st European Across Grids Conference, Santiago de Compostela, Spain, February 2003, LNCS, 2970:33 40, 2004. [3] S. Andreozzi, N. D. Bortoli, S. Fantinel, A. Ghiselli, G. Rubini, G. Tortone, and M. Vistoli. GridICE: a Monitoring Service for Grid Systems. Future Generation Computer Systems Journal, 21(4):559571, 2005. [4] S. Andreozzi, S. Burke, L. Field, S. Fisher, B. Kónya, M. Mambelli, J. Schopf, M. Viljoen, and A. Wilson. GLUE Schema Specification - Version 1.2, Dec 2005. [5] Berkeley Database. http://www.oracle.com/technology/products/berkeleydb/index.html. [6] BioinfoGRID: Bioinformatics Grid Application for life science. http://www.bioinfogrid.eu/. [7] K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman. Grid Information Services for Distributed Resource Sharing. In Proceedings of the 10th IEEE International Symposium on High-Performance Distributed Computing (HPDC-10), San Francisco, CA, USA, Aug 2001. [8] European DataTAG project. http://datatag.web.cern.ch/datatag/. [9] European Grid for E-sciencE, 2007. http://www.eu-egee.org.

[10] EUChinaGrid project. http://www.euchinagrid.org/. [19] Maui Cluster Scheduler. [11] EU-IndiaGrid project. http://www.euindiagrid.eu/. http://www.clusterresources.com/pages/products/mauicluster-scheduler.php. [12] EUMedGrid project. http://www.eumedgrid.org/. [13] G. Good. The LDAP Data Interchange Format [20] OMII-Europe. http://omii-europe.org. (LDIF). IETF RFC 2849, Jun 2000. [21] OpenLDAP - The OpenLDAP Project. [14] GridICE Website. http://grid.infn.it/gridice. http://openldap.org. [15] E. Laure et al. Programming the Grid with glite. [22] Portable Batch System tools - Ohio Supercomputer Technical Report EGEE-TR-2006-001, CERN, 2006. Center. [16] LEMON - LHC Era Monitoring. http://svn.osc.edu/repos/pbstools/trunk/readme. http://cern.ch/lemon/. [23] TORQUE Resource Manager. [17] International Laboratory on Bioinformatics. http://www.clusterresources.com/pages/products/- http://www.libi.it/. torque-resource-manager.php. [18] Load Sharing Facility (LSF). [24] J. Treadwell. The Open Grid Services Architecture http://www.platform.com/products/platform.lsf.family/ (OGSA) Glossary of Terms Version 1.5. OGF Platform.LSF/. GFD.81, Jul 2006. [25] W. Wahl, T. Howes, and S. Kille. Lightweight Directory Access Protocol v.3, RFC 2251, IETF, Dec 1997.