Lessons Learned in the NorduGrid Federation

Similar documents
Distributing storage of LHC data - in the nordic countries

Scientific data management

Operating the Distributed NDGF Tier-1

A distributed tier-1. International Conference on Computing in High Energy and Nuclear Physics (CHEP 07) IOP Publishing. c 2008 IOP Publishing Ltd 1

Data Storage. Paul Millar dcache

dcache Introduction Course

ARC integration for CMS

LCG data management at IN2P3 CC FTS SRM dcache HPSS

Andrea Sciabà CERN, Switzerland

Data Management for the World s Largest Machine

dcache: challenges and opportunities when growing into new communities Paul Millar on behalf of the dcache team

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

dcache, activities Patrick Fuhrmann 14 April 2010 Wuppertal, DE 4. dcache Workshop dcache.org

Future trends in distributed infrastructures the Nordic Tier-1 example

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

and the GridKa mass storage system Jos van Wezel / GridKa

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

dcache: sneaking up on NFS4.1

Influence of Distributing a Tier-2 Data Storage on Physics Analysis

A scalable storage element and its usage in HEP

Status of KISTI Tier2 Center for ALICE

The EU DataGrid Testbed

Towards sustainability: An interoperability outline for a Regional ARC based infrastructure in the WLCG and EGEE infrastructures

ARC NOX AND THE ROADMAP TO THE UNIFIED EUROPEAN MIDDLEWARE

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

The Legnaro-Padova distributed Tier-2: challenges and results

Grid Data Management

Computing / The DESY Grid Center

Philippe Charpentier PH Department CERN, Geneva

Atlas Managed Production on Nordugrid

DESY. Andreas Gellrich DESY DESY,

PoS(EGICF12-EMITC2)081

Chelonia. a lightweight self-healing distributed storage

Performance of the NorduGrid ARC and the Dulcinea Executor in ATLAS Data Challenge 2

The German National Analysis Facility What it is and how to use it efficiently

Grid Architectural Models

Introduction to SRM. Riccardo Zappi 1

EGEE and Interoperation

N. Marusov, I. Semenov

The LHC Computing Grid

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

First Experience with LCG. Board of Sponsors 3 rd April 2009

Experiences with the new ATLAS Distributed Data Management System

The Global Grid and the Local Analysis

Data Access and Data Management

Analysis of internal network requirements for the distributed Nordic Tier-1

The INFN Tier1. 1. INFN-CNAF, Italy

The Grid. Processing the Data from the World s Largest Scientific Machine II Brazilian LHC Computing Workshop

Introduction to Programming and Computing for Scientists

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Batch Services at CERN: Status and Future Evolution

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Usage statistics and usage patterns on the NorduGrid: Analyzing the logging information collected on one of the largest production Grids of the world

CHIPP Phoenix Cluster Inauguration

The glite middleware. Ariel Garcia KIT

Grid Computing Activities at KIT

The grid for LHC Data Analysis

The Grid: Processing the Data from the World s Largest Scientific Machine

HEP Grid Activities in China

The LHC Computing Grid

Data transfer over the wide area network with a large round trip time

Tier-2 DESY Volker Gülzow, Peter Wegner

Storage and I/O requirements of the LHC experiments

Introduction to Programming and Computing for Scientists

Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008

Data Transfers Between LHC Grid Sites Dorian Kcira

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

Overview of HEP software & LCG from the openlab perspective

A New Grid Manager for NorduGrid

glite Grid Services Overview

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

HEP replica management

PoS(EGICF12-EMITC2)106

The Grid Monitor. Usage and installation manual. Oxana Smirnova

Managing Scientific Computations in Grid Systems

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

where the Web was born Experience of Adding New Architectures to the LCG Production Environment

Architecture Proposal

Dynamic Federations. Seamless aggregation of standard-protocol-based storage endpoints

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

The LHC Computing Grid. Slides mostly by: Dr Ian Bird LCG Project Leader 18 March 2008

Geographical failover for the EGEE-WLCG Grid collaboration tools. CHEP 2007 Victoria, Canada, 2-7 September. Enabling Grids for E-sciencE

High Energy Physics data analysis

IEPSAS-Kosice: experiences in running LCG site

Austrian Federated WLCG Tier-2

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

AGIS: The ATLAS Grid Information System

Grid Challenges and Experience

ATLAS operations in the GridKa T1/T2 Cloud

Outline. ASP 2012 Grid School

Clouds at other sites T2-type computing

Monitoring ARC services with GangliARC

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

The European DataGRID Production Testbed

EMI Deployment Planning. C. Aiftimiei D. Dongiovanni INFN

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

Transcription:

Lessons Learned in the NorduGrid Federation David Cameron University of Oslo With input from Gerd Behrmann, Oxana Smirnova and Mattias Wadenstein Creating Federated Data Stores For The LHC 14.9.12, Lyon, France EMI is partially funded by the European Commission under Grant Agreement RI-261611

History Lesson 2001 NorduGrid collaboration formed by Scandinavian universities Grid computing for LHC physicists Resources provided by institutes Grid middleware: Globus GridFTP SE, MDS info-system, RLS catalog Advanced Resource Connector (ARC) CE interface, batch system interaction, data staging, client tools, VOs, accounting etc.

History Lesson NorduGrid architecture, 2002

Moving Forward 2006 Nordic DataGrid Federation (NDGF) Nordic Tier 1 centre for WLCG ARC CEs + Distributed dcache SE + ARC CE caches Distributed resources presented as single entity Resources still owned by institutes NDGF provides connecting glue

NDGF-T1 Distributed Centre? No one country large enough to host T1 itself Nordic culture of cooperation Blurry Tier concept No one site dominates Good enough network (+ local caching) so that computing not tied to storage Distributed Computing (easy) Distributed Storage (not so)

dcache Transparent access to data on mass storage systems under a single namespace Interaction with HSM Doors provide access via various protocols eg SRM, GridFTP

Distributed dcache OPN

Distributed dcache Front-end nodes near Copenhagen (next to OPN switch) srm.ndgf.org, ftp1.ndgf.org, namespace DB,... Pool nodes scattered from Ljubljana to Umeå 10Gb/s links Designed for maximum availability for acceptance of T0 data Front-end is central point of failure (as with any other site) But failure of one pool/site only leads to some data unavailable Internal replication of recent data minimises effect

NDGF-T1 NDGF T1 resources, 2006

NDGF-related dcache improvements Critical factor - NDGF developer (Gerd) became dcache developer GridFTPv2 Control channel via head node, data channel directly via pools New namespace implementation New SRM service container WebDAV support xrootd support Current protocols supported (dcache doors) SRM, GridFTP, HTTP, WebDAV, DCAP, GSIDCAP, Xrootd, NFS 4.1

Other Tiers Several Tier2/3 sites are associated with NDGF-T1 Some are independent with own SRM endpoint (Swegrid, Ljubljana, Bern,...) Some are simply separate pools - with same endpoint but separate SRM space tokens (Norway T2, Copenhagen, Geneva,...)

Operations Distributed storage distributed people Operator on Duty rotates among 4 countries weekly Sysadmin sitting at their institute Deal with GGUS, downtimes, operations meetings etc Chatroom for communication Weekly chat meeting (more efficient than voice!) Wiki, JIRA task tracking etc.

Current Status NDGF T1 stores ~3PB and 2M files (ATLAS + ALICE) ATLAS SRM space tokens +500TB on tape not in space token

Second level storage NDGF T1 dcache provides persistent reliable mass storage for managed data transfers On-demand replication and unmanaged storage is provided by ARC caches

ARC Data Management Architecture

ARC CE Cache Local file system (NFS, GFPS, Lustre etc) mounted on CE front-end Cached files soft-linked or copied to job's working dir Authorisation checked against original source (and cached) Files always cached unless disabled in job description Space managed automatically using LRA No administration required Not accessible from outside

Multiple Caches Data is always written locally

ARC CE Cache Counted as pledged storage but not accounted... Depends on country (T2 in Sweden) Recommended size 100TB for 2000-core site running ATLAS production/analysis Estimate ~1PB cache space in all NorduGrid Cache filesystem must have very good performance! ARC CE writing + jobs reading Can be scaled by adding more CE staging nodes and more caches

ARC Cache Index Service Caches publish their content periodically to a central index Using Bloom filters for efficiency which leads to false positives Very simple web service http query returns JSON dictionary of url:sites Jobs can be brokered to sites where files are cached If false positive or file was deleted, it doesn't matter! ARC can download it again No need for enforced consistency

NDGF

Advantages The combination of distributed persistent managed storage and caching gives many advantages Pool downtime does not have to block jobs No administration is required for the caches No consistency requirement Automatic replication on demand of popular data No replication of unused data Reduced load on managed storage Managed storage does not need to be fast (for direct random data reading)

What next? Read from dcache via HTTPS instead of GridFTP Solves network problems with multi-homed machines and OPN/public network Writing of large files via HTTP still problematic Access data from ARC caches on other sites Back-up replicas if main storage is down But we don't want another SE Options: Make xrootd federation of caches with loose consistency Security? Make own federation using ACIX + ARC CE HTTP interface

Lessons Learned NDGF staff core developers in dcache and ARC Rapid availability of required new features Influence over development strategy And involved in user communities (ATLAS, ALICE, etc) Automatic internal dcache replication and caching saves us in pool downtimes Distributed coordination takes a lot of close communication and learning Some people more experienced than others Automatic well-documented procedures for everything Users change requirements all the time and like control Hardly any traditional middleware is used these days for job management, will data/storage management follow? Availability!= happiness Impossible to make general system to suit everyone Even if that's not what funding agencies want to hear... *Disclaimer: presenter is funded by EMI, which funds ARC and dcache