EGI-InSPIRE RI NGI_IBERGRID ROD. G. Borges et al. Ibergrid Operations Centre LIP IFCA CESGA

Similar documents
Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009

E u r o p e a n G r i d I n i t i a t i v e

Outline. Infrastructure and operations architecture. Operations. Services Monitoring and management tools

On the EGI Operational Level Agreement Framework

Provisioning of Grid Middleware for EGI in the framework of EGI InSPIRE

COD DECH giving feedback on their initial shifts

EGI-InSPIRE COD F2F. Ron Trompert. 11/14/12 1 EGI-InSPIRE RI

Geographical failover for the EGEE-WLCG Grid collaboration tools. CHEP 2007 Victoria, Canada, 2-7 September. Enabling Grids for E-sciencE

E GI - I ns P I R E EU DELIVERABLE: D4.1.

EGI Operations and Best Practices

Technical Brief SUPPORTPOINT TECHNICAL BRIEF MARCH

EGI-InSPIRE. EGI Applications Database (TNA3.4) William Vassilis Karageorgos, et al.

Security Service Challenge & Security Monitoring

USER QUICK LOOK FOR FACULTY & STAFF

CSA Helpdesk User Guide

Design, Development and Improvement of Nagios System Monitoring for Large Clusters

How PERTs Can Help With Network Performance Issues

ICON Laboratory Services, Inc. isite User Guide

Step by Step ASR, Azure Site Recovery

EMI Deployment Planning. C. Aiftimiei D. Dongiovanni INFN

Getting Started with Soonr

Magento Enterprise Edition Customer Support Guide

Magento Integration Manual (Version /15/2017)

Minimum Requirements for Cencon 4 with Microsoft R SQL 2008 R2 Enterprise

Regional e-infrastructures

1. Create a customer account.

Objective of the Postgraduate Online Approval Procedure: Description of the Postgraduate Online Approval Procedure:

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Security. ITM Platform

An End User s Perspective of Central Administration

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

CESGA: General description of the institution. FINIS TERRAE: New HPC Supercomputer for RECETGA: Galicia s Science & Technology Network

Perform Configuration Audits Using Compliance

Integrating with your Third Party Monitoring Software

Lookout Mobile Endpoint Security. Deploying Lookout with BlackBerry Unified Endpoint Management

European Grid Infrastructure

op5 Monitor 4.1 Manual

Is this a known issue? Seems to affect only recurring events. I have some of them and all are shifted. Non-recurring events show properly.

Deploy Webex Video Mesh

Opportunities A Realistic Study of Costs Associated

Please review the SECURE GATEWAY Administrator Guides prior to engaging the AT&T Managed Security Services Support Center.

A European Cloud federation

Development Application Online. HowTo guide for applicants

Quick Reference Guide: Working with CommVault Customer Support

eprocurement MET Pilot Portal - SUPPLIER GUIDE Pag. 1/56 TABLE OF CONTENTS

Account Customer Portal Manual

LRAs Attestations of Their End Users Access to ClinicalConnect

Quick Start: Permission requests and approvals

Notification Template Limitations. Bridge Limitations

Frequently Asked Questions for Sales & Marketing Users

MAINTENANCE HELPDESK SYSTEM USER MANUAL: CUSTOMER (STAFF) VERSION 2.0

Understanding Remedyforce Sandboxes

Version 2.4 Final. TMDB System User Manual (Registry)

Online Expenses User Guide System Provided by Software Europe

Enquiries about results guide (UK) 2018 A guide for exams officers

SMS ALARM APPLICATION

StruxureWare TM Data Center Operation Periodic Maintenance. Statement Of Work. 2.0 Features & Benefits

User Guide. Remote Support Tool

QUICK START GUIDE. Your Power Solutions Partner

Welcome to Hitachi Vantara Customer Support for Pentaho

2018 CLUB ENTRY SYSTEM ONLINE CONSOLE - STEP BY STEP GUIDE

StarWind Virtual SAN Installing and Configuring SQL Server 2012 Failover Cluster Instance on Windows Server 2012

Version 2.3 Final. TMDB System User Manual (Registrar)

Hosting with Eduphoria

Zemana Endpoint Security Administration Guide. Version

Nimsoft Cloud User Experience

Acronis Data Cloud plugin for ConnectWise Automate

UNOPS esourcing vendor guide. A guide for vendors to register on UNGM, and submit responses to UNOPS tenders in the UNOPS esourcing system

SQL Server Database Provisioning Report. Survey on database provisioning requirements among SQL Server Professionals

TechDirect Demo Presentation Guide

GRANTS AND CONTRIBUTIONS ONLINE SERVICES USER GUIDE: CANADA SUMMER JOBS

Team Management. Coaches and Team Managers Guide

User guide NotifySCM Installer

A s c e r t i a S u p p o r t S e r v i c e s G u i d e

cdiscount version BoostMyShop

Centralized Log Hosting Manual for User

Create Automatic Deployment Rule In SCCM 2012 R2

REPORT 2015/149 INTERNAL AUDIT DIVISION

Microsoft Dynamics AX. Lifecycle Services: Operate Phase. Last Updated: June 2014 AX 2012 R3 / Version 1.0.0

vcenter CapacityIQ Installation Guide

Oracle Identity Manager

Manual Backup Sql Server Express 2008 Scheduling Automated

GRANTS AND CONTRIBUTIONS ONLINE SERVICES USER GUIDE: CANADA SUMMER JOBS

Please Note To use this extension, you must have UVdesk account. You can create a free UVdesk account here.

DataLink Learn (SaaS or 9.1 Oct 2014+) Integration

CS 577A Team 1 DCR ARB. PicShare

INFRASTRUCTURE AND USER COMMUNITIES

UNOPS esourcing vendor guide. A guide for vendors to register on UNGM, and submit responses to UNOPS tenders in the UNOPS esourcing system

USER MANUAL LANGUAGE TRANSLATOR TABLE OF CONTENTS. Version: 1.1.6

Administrator Manual. Last Updated: 15 March 2012 Manual Version:

GfK Digital Trends for Android. GfK Digital Trends Version 1.21

Xrio UBM Quick Start Guide

HARTREE CENTRE SERVICE NOW SELF- SERVICE PORTAL

SureClose Advantage. Release Notes Version

TELUS Cloud Contact Centre (TC3) Customer Care Guide

Acronis Data Cloud plugin for ConnectWise Automate

A Guide to Finding the Best WordPress Backup Plugin: 10 Must-Have Features

Influence of Distributing a Tier-2 Data Storage on Physics Analysis

PROSPECT USER MANUAL

Communications Room Policy

Transcription:

EGI-InSPIRE RI-261323 NGI_IBERGRID ROD G. Borges et al. Ibergrid Operations Centre LIP IFCA CESGA

: Introduction IBERGRID: Political agreement between the Portuguese and Spanish governments. It foresees 5 lines of action: Grid Computing; HPC Network; Applications; Volunteer Computing Provides an umbrella for an Iberian regional grid Covers the Portuguese and Spanish NGIs Integrates Portuguese and Spanish resources Fully interoperable with EGI Share the load of NGI international tasks G. Borges et al. - EGI User Forum 2011 2

28 certified sites 7 Portuguese sites; 21 Spanish sites > 26000000 SI00* > 10700000 GB (Online Storage)* Critical services geographically distributed Setup dynamic DNS alias schemes to serve TopBDIIs and WMSs Regional VOMS and LFC operated in Portugal with redundant copies in Spain Regional Dashboard operated and installed in Portugal Regional SAM (Nagios) operated and installed in Spain * http://gstat.egi.eu/gstat/summary/egi_ngi/ngi_ibergrid/ : Infrastructure Production Core Services Backup Core Services Regional SAM Operations Coordination Production Core Services Backup Core Services Operations Coordination G. Borges et al. - EGI User Forum 2011 3 l l Production Core Services Backup Core Services Regional Dashboard Operations Coordination l

: ROD Operations NGI_IBERGRID ROD Operations Model Weekly shifts performed by LIP, CESGA and IFCA G. Borges et al. - EGI User Forum 2011 4

: Notepad Dear Site administrators, "(Host FQDN)" is failing since X hours ago. For details, please check: https://rnagios.ibergrid.cesga.es/nagios/cgi-bin/status.cgi?host="(host FQDN)" Can you please take a look before a ticket is open to your site? Don't hesitate to ask for clarifications in either [1] or [2] in case of doubts. Thanks in advance, IBERGRID ROD. [1] ibergrid-rod@listas.cesga.es [2] ibergrid-rollout@listas.cesga.es G. Borges et al. - EGI User Forum 2011 5

: 1 st line user support ROD teams are providing simultaneously the 1 st line user support to site administrators Via mailing lists and GGUS tickets G. Borges et al. - EGI User Forum 2011 6

: Operation Report Dear Ibergrid site admins, Here is the operational status report regarding alarms in the ROD Dashboard, open tickets in GGUS (snapshot taken at Friday 1st April 2011, 11:20:00 UTC). We kindly ask for the participation of the mentioned sites in the next weekly meeting for some update on the pending issues. Best Regards, Goncalo Borges, on behalf of NGI_IBERGRID ROD team. ### Dashboard Alarms ### Sites with alarms (< 24h) in the Dashboard. An warning email has been sent to these sites, and if the alarms continue active for more than 24h, a new ticket will be raised at those sites. - IEETA, UOGRID ### Open GGUS tickets ### There are 8 open tickets in the local helpdesk. Please find below a short summary of those tickets. Please take the appropriate actions: - Change the ticket status from "ASSIGNED" to "IN PROGRESS". - Provide feedback on the issue as regularly as possible. - In case of problems, ask for help in ibergrid-ops@listas.cesga.es - For long pending issues, put your site/node in downtime. - Don't forget to close the ticket when you have solved the problem. =========================================================================== SITE : * BIFI * GGUS ID : 68459 Open since : March 10 2011 09:32 UTC Status : in progress Description : Uploading problem at ce-egee.bifi.unizar.es BIFI Link : https://gus.fzk.de/ws/ticket_info.php?ticket=68459 ======================================================================== G. Borges et al. - EGI User Forum 2011 7

: Response Level NGI_IBERGRID Site Responsiveness model G. Borges et al. - EGI User Forum 2011 8

: Escalations G. Borges et al. - EGI User Forum 2011 9

: Regional Dashboard Installed and operated by LIP G. Borges et al. - EGI User Forum 2011 10

: Regional Dashboard The tool is based on 3 components The Web Interface The Web Service Lavoiser The DataBase All services are deployed in the same machine (but they could be deployed separately) Tool deployment Deployment of the Web Sevice Setup of the Database Configuration of http conf files and php files Why did we choose to deploy it To gain experience with the tool and with the technology To be able to adjust the tool to our own needs, if necessary To be become independent G. Borges et al. - EGI User Forum 2011 11

:Dashboard issues Some steps behind w.r.t. central dashboard In the current version is still not mandatory to provide an explanation for the closure of a non-ok alarm Testing of the released regional packages Not very good! We seem to be always the first ones really trying to deploy the service A lot of interactions with the support staff is needed Synchronization problems between regional and central dashboard Complains from COD that ROD is not doing well. When we start to investigate, must of the times, the problem is due to a sync issue between regional dashboard and central dashboard. It is quite annoying receiving tickets which go to our management saying that we are not performing well, and at the end, this is not true. https://gus.fzk.de/ws/ticket_info.php?ticket=68414 G. Borges et al. - EGI User Forum 2011 12

:Operation issues Issue 1: r-nagios alarms notifications Frequently, the dashboard does not reflect a r-nagios change from critical to ok, and the alarms keeps aging forever. Dashboard developers claim that the problem is on r-nagios side which does not sends the proper message ROD team is forced to close the alarm in a non-ok state, ruining its workload. Regional dashboard still does not has the functionality for providing a justification for the non-ok alarm closure. This is a job for COD: Coordinate the interactions from Dashboard and r-nagios developers to solve the problem https://gus.fzk.de/ws/ticket_info.php?ticket=68414 G. Borges et al. - EGI User Forum 2011 13

: Operation issues Issue 2: Testing new services A service can be tagged in GOCDB as Monitored and Not in Production Sites use these flags to test a new service Those settings still raise alarms in the ROD Dashboard, which have to be handled accordingly Sites complain (and they are right) that the service is declared as Not in Production and therefore, no tickets should be opened. Workaround: We ask the sites to declare a new downtime even for all services they wish to test. G. Borges et al. - EGI User Forum 2011 14

: Operation issue Issue 3: Downtime for alarms < 24h Due to the adopted model, we notify sites for failures occurring for < 24h When sites recognize that the problems may take time to solve, start a downtime for an alarm < 24h The alarm keeps aging. When it reaches 72h, it doesn't make sense to open a ticket because the site is in downtime ROD is forced to close the alarm in a non-ok status, ruining once more the ROD workload. G. Borges et al. - EGI User Forum 2011 15