Geographical failover for the EGEE-WLCG Grid collaboration tools. CHEP 2007 Victoria, Canada, 2-7 September. Enabling Grids for E-sciencE

Similar documents
Geographical failover for the EGEE-WLCG grid collaboration tools

From EGEE OPerations Portal towards EGI OPerations Portal

EELA Operations: A standalone regional dashboard implementation

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Regional SEE-GRID-SCI Training for Site Administrators Institute of Physics Belgrade March 5-6, 2009

NN CS 704 NEURONEXT NETWORK STANDARD OPERATING PROCEDURE FOR DATA BACKUP, RECOVERY, AND CONTINGENCY PLANS

The Grid: Processing the Data from the World s Largest Scientific Machine

EGI-InSPIRE RI NGI_IBERGRID ROD. G. Borges et al. Ibergrid Operations Centre LIP IFCA CESGA

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

The Grid. Processing the Data from the World s Largest Scientific Machine II Brazilian LHC Computing Workshop

Eliminate Idle Redundancy with Oracle Active Data Guard

ArcGIS Enterprise: Configuring Backups, Disaster Recovery, and Replication. Harrold Sompotan and Patrick Jackson

AUTOTASK ENDPOINT BACKUP (AEB) SECURITY ARCHITECTURE GUIDE

Hosting with Eduphoria

High Availability & Disaster Recovery. Witt Mathot

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Avoiding the Cost of Confusion: SQL Server Failover Cluster Instances versus Basic Availability Group on Standard Edition

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

COD DECH giving feedback on their initial shifts

ELFms industrialisation plans

LHCb Distributed Conditions Database

GR Reference Models. GR Reference Models. Without Session Replication

New England Data Camp v2.0 It is all about the data! Caregroup Healthcare System. Ayad Shammout Lead Technical DBA

Maximum Availability Architecture (MAA): Oracle E-Business Suite Release 12

Disaster Recovery System

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

Security+ Guide to Network Security Fundamentals, Third Edition. Chapter 13 Business Continuity

High Availability and Disaster Recovery. Cherry Lin, Jonathan Quinn

SQL Azure. Abhay Parekh Microsoft Corporation

EMC RecoverPoint. EMC RecoverPoint Support

Import Export Products Attributes

Andrea Sciabà CERN, Switzerland

Technical Brief SUPPORTPOINT TECHNICAL BRIEF MARCH

IBM InfoSphere Streams v4.0 Performance Best Practices

Backup the System. Backup Overview. Backup Prerequisites

Agenda. What is Replication?

Zero Downtime Migrations

IBM WebSphere Application Server 8. Clustering Flexible Management

DISASTER RECOVERY PRIMER

Status of KISTI Tier2 Center for ALICE

Failover procedure for Grid core services

BeBanjo Infrastructure and Security Overview

Chapter 1: Distributed Systems: What is a distributed system? Fall 2013

BusinessObjects XI Release 2

Technical Note. Dell/EMC Solutions for Microsoft SQL Server 2005 Always On Technologies. Abstract

Managing Copy Services

Distributed Monte Carlo Production for

Table of Contents Chapter 1: Migrating NIMS to OMS... 3 Index... 17

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

Protecting Mission-Critical Workloads with VMware Fault Tolerance W H I T E P A P E R

Migration and Building of Data Centers in IBM SoftLayer

Easy Access to Grid Infrastructures

Setup Desktop Grids and Bridges. Tutorial. Robert Lovas, MTA SZTAKI

Towards sustainability: An interoperability outline for a Regional ARC based infrastructure in the WLCG and EGEE infrastructures

GIS Application Delivery Challenges & Solutions During State of Emergencies

Step by Step ASR, Azure Site Recovery

BME CLEARING s Business Continuity Policy

Microsoft SQL Server

AWS Course Syllabus. Linux Fundamentals. Installation and Initialization:

Systemwalker Service Quality Coordinator. Technical Guide. Windows/Solaris/Linux

Distributed Data Management with Storage Resource Broker in the UK

W H I T E P A P E R : T E C H N I C AL. Symantec High Availability Solution for Oracle Enterprise Manager Grid Control 11g and Cloud Control 12c

Advanced Architectures for Oracle Database on Amazon EC2

EMC ViPR Controller. System Disaster Recovery, Backup and Restore Guide. Version

ARC BRIEF. Application Downtime, Your Productivity Killer. Overview. Trends in Plant Application Adoption. By John Blanchard and Greg Gorbach

Security Service Challenge & Security Monitoring

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

MySQL High Availability Solutions. Alex Poritskiy Percona

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

TELUS Cloud Contact Centre (TC3) Customer Care Guide

MOVING TOWARDS ZERO DOWNTIME FOR WINTEL Caddy Tan 21 September Leaders Have Vision visionsolutions.com 1

RecoverPoint Operations

Veeam Agent for Microsoft Windows

Lessons Learned in the NorduGrid Federation

EMC ViPR Controller. System Disaster Recovery, Backup and Restore Guide. Version

Outline. Infrastructure and operations architecture. Operations. Services Monitoring and management tools

Cloud Operations for Oracle Cloud Machine ORACLE WHITE PAPER MARCH 2017

A CommVault White Paper: Business Continuity: Architecture Design Guide

European Grid Infrastructure

System Specification

Oracle E-Business Availability Options. Solution Series for Oracle: 2 of 5

AGIS: The ATLAS Grid Information System

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Global grid user support building a worldwide distributed user support infrastructure

Grid Scheduling Architectures with Globus

Ideas for evolution of replication CERN

Database Assessment for PDMS

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

SERVICE DESCRIPTION MANAGED BACKUP & RECOVERY

Perforce Replication. The Definitive Guide. Sven Erik Knop Senior Consultant

Polarion 18.2 Enterprise Setup

FUJITSU Storage ETERNUS AF series and ETERNUS DX S4/S3 series Non-Stop Storage Reference Architecture Configuration Guide

Chapter 2 CommVault Data Management Concepts

Introduction to Grid Computing

MCTS Guide to Microsoft Windows Server 2008 Applications Infrastructure Configuration (Exam # ) Chapter One Introducing Windows Server 2008

Epicor ERP Cloud Services Specification Multi-Tenant and Dedicated Tenant Cloud Services (Updated July 31, 2017)

Redundancy for Corporate Broadband WHITE PAPER

Oracle 10g and IPv6 IPv6 Summit 11 December 2003

An Oracle White Paper May Oracle VM 3: Overview of Disaster Recovery Solutions

Transcription:

Geographical failover for the EGEE-WLCG Grid collaboration tools CHEP 2007 Victoria, Canada, 2-7 September Alessandro Cavalli, Alfredo Pagano (INFN/CNAF, Bologna, Italy) Cyril L'Orphelin, Gilles Mathieu, Osman Aidel (IN2P3/CNRS Computing Centre, Lyon, France) Rafal Lichwala (PSNC, Poznan, Poland)

2/12 The Failover System Outline Technical solutions the DNS and the new domain Oracle replication Use cases CIC Portal SAM Admin GSTAT, GRIDICE GOCDB SAM Future plans and improvements Distributed agents Monitoring system CIC Portal?

3/12 Failover: definition and principle A backup operation that automatically switches to a standby database, server or network if the primary system fails or is temporarily shut down for servicing. Failover is an important fault tolerance function of mission-critical systems that rely on constant accessibility. Failover automatically and transparently to the user redirects requests from the failed or down system to the backup system that mimics the operations of the primary system.

4/12 Failover: definition and principle How much availability must we guarantee?

5/12 Downtime causes Magic words are: Redundancy Remove Single Points of Failure (SPOF)

6/12 EGEE Failover: purpose Propose, implement and document failover procedures for the collaboration, management and monitoring tools used in EGEE/WLCG Grid. The mentioned tools (listed later in this talk) are daily and heavily used by COD teams, regional and sites operators and other user categories, for grid management and control purposes. These are the reasons for an availability requirement that is high and which tend to become higher in future.

7/12 EGEE Failover: background and history Born as EGEE SA1 Operations COD task Reminder: who are the CODs? Teams provided by EGEE federations, working in pairs (one lead and one backup) on a weekly rotation Role: Watch the problems detected by the grid monitoring tools Problem diagnosis Report these problems (GGUS tickets) Follow and escalate them if needed (well defined procedure) Provide help, propose solutions Build and maintain a central knowledge database (WIKI)

8/12 Links to other projects LCG-3D Some differences Our failover activity deals with operational tools LCG-3D deals with databases and data transfers. Some similarities and shared goals work on databases replication and switches Same concern in disaster recovery solutions Often involves the same (or at least related) teams. discussions and working sessions engaged Other failover related examples TODO

9/12 DNS choice: Well supported by local staff at our institutes Easy to understand how to exploit its features Very stable and consolidated (born in the 70s) DNS based failover Widely used as element for failover solution by ISPs and IT companies the DNS approach consist in: mapping the service name to one or more destinations update this mapping whenever some failure is detected this must be coupled with procedures that: keep data in sync where it is needed kill unnecessary processes on the system in failure enable needed processes on the replacing system.

10/12 ISP caching policies Caching at OS level Caching by the web browsers DNS downsides TO BE COMPLETED

11/12 Technical details & new gridops.org domain TODO

12/12 Available operations tools Available operations tools: CIC Portal GSTAT GRIDICE GOCDB SAM SAMAP Many tools, many links on www.gridops.org TO BE REVISED

13/12 geographical failover asking for CIC CIC portal (main) DNS DNS CIC portal (replica)

14/12 geographical failover: DNS switch asking for CIC CIC portal (main) DNS DNS activate if needed CIC portal (replica)

15/12 geographical failover: automatic switch CIC portal (main) DNS DNS activate if needed CIC portal (replica)

16/12 geographical failover: active-active asking for GSTAT GSTAT (main) asking for GSTAT2 DNS DNS GSTAT (replica)

17/12 Replication added early on the list Highly critical tool Use case: CIC portal failover Planned or unexpected service outages could break the continuity of daily activity First switch successfully done in December 2006 Replica instance used in production during one whole week Normal use of the portal during this time No problem reported

18/12 CIC portal failover (cont.) The CIC portal is based on three distinct components : A web portal module ( php and html, css files) A database module (Oracle) A data processing system (named Lavoisier) Each component can work with any of the other, master or replica: 8 possible scenarios http://cic.gridops.org! http://cic.gridops.org User Web portal module Lavoisier module External data sources IN2P3 User Web portal module Lavoisier module CNAF External data sources Internal data Database module Internal data Database module

Web module replication CHEP 2007 Portable code CIC portal failover (cont.) Environment & configuration on replica (Apache, PHP, libs) Host certificate for the replica Data processing system (Lavoisier) replication Environment and configuration on replica (Apache ANT, Globus toolkit, java web services) Deployment of a second instance of Lavoisier Settings on replica (e-mail address for local error reporting ) Database replication Dump of master DB exported to replica Well established procedure, involving 2 persons and an interruption of service We are working on better solutions 19/12

20/12 Use case: SAMAP functionality SAMAP( SAM Admin's Page) - web-based monitoring demand tool for submitting SAM Framework test jobs on based on SAM-client (Site Availability Monitoring ( EGEE Framework in additional functionality implemented in response to the site administrator's needs provided functionality: SAM job submission on demand check status of the running test jobs cancel submitted SAM jobs publish SAM job results show logging info of the running test jobs ( management schedule regular SAM job submission (cron task

21/12 SAMAP failover SAMAP architecture divided into two independent parts: Portal part and UI (grid User Interface) part Portal part integrated with the CIC Portal UI part installed on dedicated UI machine SAMAP installed in two independent instances and linked to proper DNS domain entries synchronization of instances via CVS repository separate WMS servers available for both instances easy switch from main to backup instance by DNS entry modification full transparency for end users

22/12 TODO Use cases: GSTAT,GRIDICE,GOCDB,SAM

23/12 TODO Future plans