Disaster recovery of the INFN Tier-1 data center: lesson learned. Luca dell Agnello INFN - CNAF CHEP 2018 conference Sofia, July

Similar documents
The INFN Tier1. 1. INFN-CNAF, Italy

IXcellerate Moscow One Datacentre - Phase 1 & 2 Overview

Europe and its Open Science Cloud: the Italian perspective. Luciano Gaido Plan-E meeting, Poznan, April

The Legnaro-Padova distributed Tier-2: challenges and results

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

In this unit we are going to review a set of computer protection measures also known as countermeasures.

High Performance Computing (HPC) Data Center Proposal

Long Term Data Preservation for CDF at INFN-CNAF

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Achieving the Science DMZ

Physics Computing at CERN. Helge Meinhard CERN, IT Department OpenLab Student Lecture 21 July 2011

POWERING A CONNECTED ASIA. Pacnet Hong Kong DataSpace1 Technical Specifications. Advanced Data Center Facility for Multi-Site Enterprises

Storageflex HA3969 High-Density Storage: Key Design Features and Hybrid Connectivity Benefits. White Paper

Grid Computing Activities at KIT

CV activities on LHC complex during the long shutdown

Opportunities A Realistic Study of Costs Associated

Level 3 Certificate in Cloud Services (for the Level 3 Infrastructure Technician Apprenticeship) Cloud Services

CC-IN2P3: A High Performance Data Center for Research

Computing activities in Napoli. Dr. Silvio Pardi (INFN-Napoli) Belle II Italian collaboration meeting 21 November 2017 Pisa - Italy

CouchDB-based system for data management in a Grid environment Implementation and Experience

COLOCATION 1/4 RACK RACK COLD AISLE CONTAINEMENT PRIVATE CAGE PRIVATE SUITE FLEXIBLE PRICING OPTIONS

Critical IT Facilities at the Edge

Utility System Infrastructure

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

The CMS Computing Model

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

Creating Response Awareness:

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Our Bethel, Connecticut facility is on-line and functioning normally. We experienced no infrastructure damage due to the storm.

After the Attack. Business Continuity. Planning and Testing Steps. Disaster Recovery. Business Impact Analysis (BIA) Succession Planning

Charles Nolan. Do more with less with your data centre A customer s perspective. Director IT Infrastructure, UNSW. Brought to you by:

Virtualizing a Batch. University Grid Center

Twin Core Data Center Munich

WHITE PAPER. Solutions OnDemand Hosting Overview

Datacentre Milton Keynes Hall C Data sheet

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Physics Computing at CERN. Helge Meinhard CERN, IT Department OpenLab Student Lecture 27 July 2010

An Operational Cyber Security Perspective on Emerging Challenges. Michael Misumi CIO Johns Hopkins University Applied Physics Lab (JHU/APL)

3.3 Understanding Disk Fault Tolerance Windows May 15th, 2007

5.2 MW, Pod-based build, Chilled Water, ft 2

InfoSphere Warehouse with Power Systems and EMC CLARiiON Storage: Reference Architecture Summary

INFORMATION SECURITY- DISASTER RECOVERY

Andrew Durant/Ellen Sullivan

The Center for High Performance Computing. Dell Breakfast Events 20 th June 2016 Happy Sithole

Data Center Operations Guide

CenturyLink Data Center Services Herndon, Virginia

Modern RAID Technology. RAID Primer A Configuration Guide

Cluster Setup and Distributed File System

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

ECMWF press kit: Bologna to host ECMWF's new data centre

Challenges in making Lustre systems reliable

Data Centre Stockholm II, Sweden Flexible, advanced and efficient by design.

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

LG CNS Global Cloud Data Center

6/5/ Michael Hojnicki Chief of Technology and Administrative Services

Benoit DELAUNAY Benoit DELAUNAY 1

LHCb Computing Resources: 2018 requests and preview of 2019 requests

New HPC architectures landscape and impact on code developments Massimiliano Guarrasi, CINECA Meeting ICT INAF Catania, 13/09/2018

The Network Infrastructure for Radio Astronomy in Italy

Dazelidis Thanos - Product Manager Rittal Greece 1

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Tape Data Storage in Practice Minnesota Supercomputing Institute

Improved power supply after Fukushima and OPC in Japanese BWRs

MSYS 4480 AC Systems Winter 2015

IEPSAS-Kosice: experiences in running LCG site

A High Availability Solution for GRID Services

CenturyLink Data Center Services Omaha, Nebraska

Experience of the WLCG data management system from the first two years of the LHC data taking

2014 Bond Technology Infrastructure Status Report. March 10, 2016

2012 Business Continuity Management for CRISIS. Network Infrastructure for BCM

TVA Response to the Fukushima Event and Lessons Learned from Recent Natural Disasters

Overview of ATLAS PanDA Workload Management

DATACENTER BERLIN Definitely a Good Perspective

Datacentre Reading East 2 Data sheet

Facilities Department Town of Duxbury, MA

CERN IT-DB Services: Deployment, Status and Outlook. Luca Canali, CERN Gaia DB Workshop, Versoix, March 15 th, 2011

BUSINESS CONTINUITY AND DISASTER RECOVERY ACROSS GOVERNMENT BOUNDARIES

Table of Contents. Course Introduction. Table of Contents Getting Started About This Course About CompTIA Certifications. Module 1 / Server Setup

Testing an Open Source installation and server provisioning tool for the INFN CNAF Tier1 Storage system

Definition of RAID Levels

Module 4 STORAGE NETWORK BACKUP & RECOVERY

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

Commissioning Mission Critical Facilities

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

CUNY Graduate Center Information Technology. IT Provisioning for Business Continuity & Disaster Recovery Effective Date: April 6, 2018

CSCS CERN videoconference CFD applications

IBM TotalStorage FAStT900 Storage Server New Offering Expands IBM SAN Storage Solutions

RFP Annex A Terms of Reference UNHCR HQ Data Centre Colocation Service

ANSI/BICSI Data Center Design and Implementation Best Practices

Mission Statement & Company Overview

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

Capacity and Power Management: The Forgotten Factors in Disaster Recovery Planning Presented by: Clemens Pfeiffer CTO Power Assure, Inc.

Datacentre Newbridge Data sheet

Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF

Transcription:

Disaster recovery of the INFN Tier-1 data center: lesson learned Luca dell Agnello INFN - CNAF CHEP 2018 conference Sofia, July 12 2018

INFN National Institute for Nuclear Physics (INFN) is funded by Italian government Main mission is the research and the study of elementary particles and physics laws of the Universe Composed by several units ~ 20 units dislocated at the main Italian University Physics Departments 4 Laboratories 2 National Centers dedicated to specific tasks CNAF, located in Bologna, is the INFN National Center dedicated to computing applications 2

The INFN Tier-1 First incarnation dates back to 2003 as computing center for BaBar, CDF, Virgo and prototypical for LHC experiments (Alice, ATLAS, CMS, LHCb) After a complete infrastructural refurbishing in 2008, it nowadays provides services and resources to more than 30 scientific collaborations 70-80% resources for WLCG experiments. ( but most effort due to support non-wlcg experiments!) Planning a new data center to cope with the high demanding computing requirements of HL-LHC and newly coming experiments 3

The INFN Tier-1: some figures The data center is located 2 levels under the street level It is divided in 4 main halls: 2 halls for IT 1 small hall for the GARR PoP 1 electrical room Resources (before the flood) ~1.000 WNs, ~20.000 computing slots, ~220 khs06 (+ ~20 khs06 in Bari-ReCaS) Also small (~33 TFlops) HPC cluster available ~23.4 PB of storage on disk 1 tape library with 42 PB of data Dedicated network channel (60 Gb/s) for LHC OPN + LHC ONE 20 Gb/s reserved for LHC ONE 21 people working on Tier-1 (including facilities support staff) 4

s Electrical room The INFN Tier-1 location Electrical room GARR Hall 2 Core switches Hall 1 Tape Library 5

The flood

November 9: the flood The flood occurred early in the morning due to the breaking of one of the main water pipelines in Bologna, located in a road next CNAF After I was alerted, my first thought was for the external doors (all Tier-1 doors are watertight) Then, with a fast check, I realized the data center was no more on line. The entrance of the data center at 7.37 CET 7

November 9: the flood The flood occurred early in the morning due to the breaking of one of the main water pipelines in Bologna, located in a road next CNAF After I was alerted, my first thought was for the external doors (all Tier-1 doors are watertight) Then, with a fast check, I realized the data center was no more on line. The entrance of the data center at 7.37 CET Our first Data lake J 8

November 9 flood: some findings Height of water outside: 50 cm Height of water inside: 10 cm (on floating floor) for a total volume of ~500 m 3 The water poured into the data center through walls and floor. Broken pipe X Ø = 20 cm The broken pipe 9

The situation outside.. Men at work on broken pipe The road collapsed under the weight of the fire truck at CNAF entrance 10

. and inside the data center Steam produced by Joule effect (electrical room) Photos taken in the afternoon after part of the water had been pumped out 11

Damage assessment and first intervention The power center was the most damaged part Both 1.4 MW power lines compromised (including control for UPS s/diesel engines) The two lower units of all racks in the IT halls were submerged Including the two lowest rows of tapes in the library All storage systems involved The 3 Core Switch/Routers and the General IP Router were safe for few centimeters First operations: data center dried over the first week-end Cleaning from dust and mud started immediately after Specialized company supported us Operation completed during the first week of December 12

Power Center configuration before the flood Powerhouse 15kV Electrical s (Energy Supplier) TR1 2500 kva TR3 TR2 UPS-1 + UPS-1 Cooling system UPS-2 UPS-2 + -CA1 CF CF -CA2 IT Chiller 1-3 Chiller 4-6 IT =Electric Panel =Generator 13 13

Power Center after the flood Powerhouse 15kV Electrical s (Energy Supplier) TR1 2500 kva TR3 TR2 UPS-1 + UPS-1 Cooling system UPS-2 UPS-2 + Damaged by water -CA1 CF CF -CA2 Damaged by water All the electrical systems Affected! IT Chiller 1-3 Chiller 4-6 IT =Electric Panel =Generator 14 14

Power Center recovering. Powerhouse 15kV Electrical s (Energy Supplier) TR1 2500 kva TR3 TR2 UPS-1 + UPS-1 Cooling system UPS-2 UPS-2 + In operation 19/12 -CA1 CF CF -CA2 First power line active (no UPS/diesel engine) IT Chiller 1-3 Chiller 4-6 IT =Electric Panel =Generator 15 15

Power Center recovering. Powerhouse 15kV Electrical s (Energy Supplier) TR1 2500 kva TR3 TR2 In row air conditioning in the IT halls restored mid January UPS-1 + UPS-1 Cooling system UPS-2 UPS-2 + In operation 19/12 -CA1 CF CF -CA2 First power line active (small UPS + diesel engine) In operation 11/01 IT UPS Chiller 1-3 Chiller 4-6 IT =Electric Panel =Generator 16 16

Power Center recovering. Powerhouse 15kV Electrical s (Energy Supplier) TR1 2500 kva TR3 TR2 In operation 15/2 UPS-1 + UPS-1 Cooling system UPS-2 UPS-2 + In operation 19/12 -CA1 CF CF -CA2 First power line completely restored IT Chiller 1-3 Chiller 4-6 IT =Electric Panel =Generator 17 17

Present Power Center status Powerhouse 15kV Electrical s (Energy Supplier) TR1 2500 kva TR3 TR2 Under evaluation UPS-1 + UPS-1 Cooling system UPS-2 UPS-2 + First power line completely restored Second power line nearly recovered. Still to be decided strategy for continuity for the second line -CA1 IT CF Chiller 1-3 CF Chiller 4-6 IT -CA2 ETA 12/7 =Electric Panel =Generator 18 18

Damage to IT equipment: the list Computing farm ~34 khs06 are now lost (~14% of the total capacity) No special action taken (replaced) Library and HSM system 1 drive and several non critical components damaged 4 TSM-HSM servers (replaced) Library recertified in January 136 tapes damaged (75 tapes sent for recovery to lab) 63 tapes fully recovered 6 tapes partially recovered 6 tapes still to be recovered Nearly all storage disk systems (and experiments) involved 11 DDN JBODs RAID parity lost 2 Huawei JBODs (non-lhc experiments) 2 Dell JBODs including controllers 4 disk-servers System PB JBODs Involved experiments Huawei 3.4 2 Dell 2.2 2 Darkside and Virgo All astroparticle and nuclear experiments excepting AMS, Darkside e Virgo DDN 1,2 1.8 4 ATLAS, Alice and LHCb DDN 8 2.7 2 LHCb DDN 9 3.8 2 CMS DDN 10, 11 10 3+2 ATLAS, Alice and AMS Total 23.9 9 19

Storage recovery In parallel with the recovery of the power system, various activities performed to recover wet IT equipment Cleaning and drying disks, servers, switches (using oven when appropriate) Apparently wet disks work after they are cleaned and dried (at least for a while.) Replacement components ordered only for systems still under support in 2018 DDN8 (LHCb) to be phased out in Q1 2018 Moreover, some components not available for bulk replacement for old systems i.e. disks for DDN8 (LHCb) and DDN9 (CMS) out of production Other older systems (DDN1, DDN2 used in mirror as buffer) repaired with spare parts we had in house 20

Storage recovery In parallel with the recovery of the power system, various activities performed to recover wet IT equipment Cleaning and drying disks, servers, switches (using oven when appropriate) Apparently wet disks work after they are cleaned and dried (at least for a while.) Replacement components ordered only for systems still under support in 2018 DDN8 (LHCb) to be phased out in Q1 2018 Moreover, some components not available for bulk replacement for old systems i.e. disks for DDN8 (LHCb) and DDN9 (CMS) out of production Other older systems (DDN1, DDN2 used in mirror as buffer) repaired with spare parts we had in house My wife s comment on my concern for data recovery: Why don t you store your data into the cloud? J 21

Storage recovery: an interlocking game Decided to move LHCb data to another storage systems and planned to use good disks from DDN8 to replace wet disks of DDN9 (CMS) To do this we needed to install 2017 tender storage to move there LHCb data But preliminarily we had to upgrade our network infrastructure (mid December) to support disk-servers for new storage (2x100 Gbps Ethernet connections) Needed also for DCI to CINECA (remote farm extension) and OPN/ONE upgrade to 2x100 Gbps New 2017 storage delivered immediately before Xmas break New storage installation completed in January Even if not validated we moved onto it all LHCb data from the damaged storage system J Later on, good disks from DDN8 used to replace wet disks of DDN9 (CMS) J 22

Storage recovery: not only good news Dell storage (Darkside and Virgo) easily recovered with the help of the support After substitution of damaged controllers, wet disks switched on and replaced one by one to allow RAID rebuild Unfortunately we could recover only 1/3 of data on Huawei system (astroparticle experiments) ~2.2 PB of data lost (mainly retransferred or regenerated) We suspect an erroneous strategy from the support Damaged disks stayed switched on for days before the support decided what to do.. 23

Storage recovery: not only good news Dell storage (Darkside and Virgo) easily recovered with the help of the support After substitution of damaged controllers, wet disks switched on and replaced one by one to allow RAID rebuild Unfortunately we could recover only 1/3 of data on Huawei system (astroparticle experiments) ~2.2 PB of data lost (mainly retransferred or regenerated) We suspect an erroneous strategy from the support Damaged disks stayed switched on for days before the support decided what to do.. Murphy s law: unique data stored on this system only. 24

Farm recovery During February we started reopening the services LSF masters, CEs, squids etc Not all experiments at the same time (depending on storage availability) Performed upgrade of WNs Middleware, security patches (i.e. meltdown etc..) Only part of the local farm powered on (only 3 chillers in production) ~150 khs06 (out of ~200kHS06 available) But exploiting the CNAF farm extension to provide more computing power Remote farm partition in Bari-RECAS (~22 khs06) Remote extension farm (~ 180 khs06) at CINECA In production since March 25

Farm remote extensions ~180 khs06 provided by CINECA for 2018-2021 CINECA, located in Bologna too (~15 Km far from CNAF), is the Italian supercomputing center and Tier-0 for PRACE 216 WNs (10 Gbit connection to rack switch and then 4x40 to router aggregator) managed by LSF@T1 Dedicated fiber directly connecting INFN Tier-1 core switches to our aggregation router at CINECA 500 Gbps (upgradable to 1.2 Tbps) on a single fiber couple via Infinera DCI No disk cache, direct access to CNAF storage Quasi-LAN situation (RTT: 0.48 ms vs. 0.28 ms on LAN) In production since March Slowly opening to non-wlcg experiments (CentOS 7) CINECA 216 WNs from Marconi A1 partition CINECA Data Center Core switch Cineca Router Dark fiber 500 Gbps General IP Network Core switch CNAF Router CNAF LHC dedicated Network INFN Tier-1 Data Center Efficiency comparable to partition @CNAF 26

Lesson learned (1) Consider also low probability events In the project for our data center foreseen all possible incidents In the project for our data center foreseen most probable incidents E.g. fires, power cuts, The only threat from water was supposed to be intense raining, not a large pipe breaking with a robust water flow Waterproof doors had been installed some years ago (after an heavy rain) The municipal water supply company, apparently aware of the problem since it happened, needed several hours to stop the flow. 27

Lesson learned (2) Wet disks and tapes are not definitely lost: clean & dry them carefully Disks should be powered on only for the needed time to copy the data The cost of recovering a wet tape in lab is ~500 to be compared with the estimated cost of reproducing its content (> 2 k, not to mention the human effort) No experiment should base its computing on a single site And, even worse, store the data in a place only. Most probably this should be a feature implemented in the computing infrastructure Some small collaborations had on user area even their official code In fact WLCG experiments could compensate easily using other sites 28

A possible future for the INFN Tier1: towards the HL-LHC Data Lake

Looking for a new location for the Tier-1 The plan: take into account the needs for HL-LHC (i.e. data lake) and expansions due to astroparticle experiments This plan has become more urgent after the flood An opportunity is given by the new ECMWF center which will be hosted in Bologna from 2019 in the new Technopole area Possibility to host in the same area also: INFN Tier-1 CINECA computing center Funding promised by Italian Government to INFN&CINECA to set up the data center (2021) Up to 2x10 MW for IT (2026) PUE ~ 1.2 30

Conclusions INFN Tier-1 fully operational since March Some hiccups at the restart Some systems not completely recovered yet Still working on 2 nd power line (needed for redundancy) Strategy for continuity on the 2 nd line not decided We are currently redesigning our alarm system to really be reactive in case of flood In the short term also activity ongoing to improve the isolation of the data center perimeter We plan to move, in the medium term (2021), our data center to another location Also to take into account requirements for HL-LHC era 31

ng i c i t Prac J e m i t t x e for n 32