Managed Data Storage and Data Access Services for Data Grids

Similar documents
LCG data management at IN2P3 CC FTS SRM dcache HPSS

A scalable storage element and its usage in HEP

Introduction to SRM. Riccardo Zappi 1

A Simple Mass Storage System for the SRB Data Grid

Introduction to Grid Computing

dcache Ceph Integration

Distributed System. Gang Wu. Spring,2018

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

Scientific data management

dcache Introduction Course

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

PROOF-Condor integration for ATLAS

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Distributed Filesystem

Understanding StoRM: from introduction to internals

glite Grid Services Overview

CLOUD-SCALE FILE SYSTEMS

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Storage Resource Sharing with CASTOR.

CSE 124: Networked Services Lecture-16

SDS: A Scalable Data Services System in Data Grid

Coordinating Parallel HSM in Object-based Cluster Filesystems

and the GridKa mass storage system Jos van Wezel / GridKa

Overcoming Obstacles to Petabyte Archives

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

Data Movement & Tiering with DMF 7

Outline. ASP 2012 Grid School

Grid Architectural Models

Long Term Data Preservation for CDF at INFN-CNAF

Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008

CSE 124: Networked Services Fall 2009 Lecture-19

Data Movement and Storage. 04/07/09 1

HEP Grid Activities in China

Lessons Learned in the NorduGrid Federation

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

Grid Data Management

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Distributed File Systems II

The Google File System

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

HEP replica management

Grid Middleware and Globus Toolkit Architecture

Data Storage. Paul Millar dcache

Monitoring the ALICE Grid with MonALISA

The Google File System (GFS)

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

A Federated Grid Environment with Replication Services

dcache NFSv4.1 Tigran Mkrtchyan Zeuthen, dcache NFSv4.1 Tigran Mkrtchyan 4/13/12 Page 1

Storage Management Solutions. The Case For Efficient Data Management In Broadcast

Distributed File Systems Part IV. Hierarchical Mass Storage Systems

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

The Google File System

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

HPSS RAIT. A high performance, resilient, fault-tolerant tape data storage class. 1

CS370 Operating Systems

Data Access and Data Management

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Google File System

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.

dcache: sneaking up on NFS4.1

Andrea Sciabà CERN, Switzerland

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Handling Big Data an overview of mass storage technologies

Hedvig as backup target for Veeam

High Performance Computing Course Notes Grid Computing I

The Google File System

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Cloud Computing. Up until now

Cluster Setup and Distributed File System

Distributing storage of LHC data - in the nordic countries

CA485 Ray Walshe Google File System

Lawrence Berkeley National Laboratory Recent Work

Fault-Tolerant Parallel Analysis of Millisecond-scale Molecular Dynamics Trajectories. Tiankai Tu D. E. Shaw Research

DELL EMC DATA DOMAIN EXTENDED RETENTION SOFTWARE

Realizing the Promise of SANs

The CMS Computing Model

CRAB tutorial 08/04/2009

Grid Computing Fall 2005 Lecture 5: Grid Architecture and Globus. Gabrielle Allen

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

The Oracle Database Appliance I/O and Performance Architecture

Monitoring the Usage of the ZEUS Analysis Grid

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

The PanDA System in the ATLAS Experiment

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

Google File System. Arun Sundaram Operating Systems

HPSS Treefrog Summary MARCH 1, 2018

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan

Google File System (GFS) and Hadoop Distributed File System (HDFS)

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

Current Topics in OS Research. So, what s hot?

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

Experience with Data-flow, DQM and Analysis of TIF Data

The INFN Tier1. 1. INFN-CNAF, Italy

Distributed Systems 16. Distributed File Systems II

Distributed File Systems Part II. Distributed File System Implementation

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Transcription:

Managed Data Storage and Data Access Services for Data Grids 1 M. Ernst, P. Fuhrmann, T. Mkrtchyan DESY J. Bakken, I. Fisk, T. Perelmutov, D. Petravick Fermilab

s defined by the GriPhyN Project lobal scientific communities, served by networks ith bandwidths varying by orders of magnitude, eed to perform computationally demanding 2 nalyses of geographically distributed datasets that ill grow by at least 3 orders of magnitude over the ext decade, from the 100 Terabyte to the 100 etabyte scale. rovide a new degree of transparency in how data is

Data is acquired at a small number of facilities Data is accessed and processed at many locations The processing of data and data transfers can be costly 3 The scientific community needs to access both raw data as well as processed data in an efficient and well managed way on a national and international scale

Discover best replicas Harness potentially large number of data, storage, network resources located in distinct administrative domains Respect local and global policies governing usage 4 Schedule resources efficiently, again subject to loca and global constraints Achieve high performance, with respect to both spee and reliability

hree major components: Storage Resource Management Data is stored on Disk Pool Servers or Mass Storage Systems Storage resource Management needs to take into account Transparent access to files (migration from/to disk pool) File Pinning 5 Space Reservation File Status Notification Life Time Management Storage Resource Manager (SRM) takes care of all these deta SRM is a Grid Service that takes care of local storage interaction and provides a Grid Interface to off - site resourc

hree major components: Storage Resource Management (cont d) Support for local policy Each Storage Resource can be managed independently Internal priorities are not sacrificed by data movements between Grid Agents6 Disk and Tape resources are presented as a single element Temporary Locking / Pinning Files can be read from disk caches rather than from tape Reservation on demand and advance reservation Space can be reserved for registering a new file Plan the storage system usage File Status and Estimates for Planning Provides Information on File Status

hree major components:. Storage Resource Management (cont d) SRM provides a consistent interface to Mass Storage regardless of whe is stored (Secondary and/or Tertiary Storage) Advantages Adds resiliency to low level file transfer services (i.e. FTP) Restarts transfer if hung Checksums 7 Traffic Shaping (to avoid oversubscription of servers and networks) Credential Delegation in 3 rd party transfer over POSIX: File Pinning, Caching, Reservation Current Limitations Standard does not include access to objects in a file POSIX file system semantics (e.g. seek, read, write) are not supp Need to use additional file I/O lib to access files in the storage sy (details on GFAL by Jean - Philippe this session at 3:40 PM) More on SRM and SRM based Grid SE

hree major components: Data Transport and Access, GridFtp Built on top of ftp Integrated with the Grid Security Infrastructure (GSI) Allows for 3 rd party control and data transfer Parallel data transfer (via multiple TCP streams) 8 Striped data transfer support for data striped or interleaved acro multiple servers Partial file transfer Restartable data transfer Replica Management Service Simple scheme for managing multiple copies of files

Metadata Catalog Logical Collection and Logical File Name Attribute Specification Application/ Data Management System 9 Selected Replica SRM commands Replica Catalog Multiple Locations Replica Selection MD Performance Information and Predictions Disk Cache Tape Library Disk Array Disk Cache

facility provider should not have to rely upon the application and vacate storage space nt architecture has bottlenecks associated with IO to the rs ult for facility providers to enforce and publish storage usag s using scripts and information providers. ult for facilities to satisfy obligations to VOs without storage gement and auditing ult for users to run reliably if they cannot ensure there is a p e out the results more important as applications with large input requiremen tempted

basic management functionality is needed on the cluster regardless much storage is there ge NFS mounted disk area still needs to be cleaned up and an appli to be able to notify the facility how long it needs to have files stored niques for transient storage management needed + dcache provides most of the functionality described earlier is the equivalent of the processing queue and makes equivalent ements storage element has some very advanced features

/dcache y developed by DESY and Fermilab es the storage cal disks or arrays are combined into a common filesystem X compliant interface D_PRELOAD library or access library compiled into the application les load balancing and system failure and recovery ation waits patiently while file staged from MSS (if applicable) es a common interface to physical storage systems lizes interfaces and hides detailed implementation igration of technology es the functionality for storage management rvises and manages transfers mvents GridFTP scalability problem (SRM initiated transfers only)

GFAL e Element (LCG) Area dcache Storage Resource Mgr. GRIS ent Cache Cache System dcap Client FTP Server (GSI, Kerberos) 13 Resilient Manager (GSI, Kerberos) dcap Server PNFS dcache Core HSM Adapter

omponents involved in Data Storage and Data Access Door e Space Provider Provides specific end point for client conne Exists as long as client process is alive Client s proxy used within dcache Interface to a file system name space 14 Maps dcache name space operations to filesystem operations Stores extended file metadata Pool Manager Performs pool selection ool Data repository handler Launches requested data transfer protoco

r-0 challenge a distribution ation challenge lysis challenge Calibration Jobs T1 Replica Conditions DB Calibration sample TAG/AOD (replica) T2 T2 DC04 Analysis challen Fake DAQ (CERN) MASTER Conditions DB T0 Replica Conditions DB T1 ts CERN disk pool ~40 TByte (~10 days data) HLT Filter? CERN 25Hz 2MB/evt 50MByte/s 4 Tbyte/day 25Hz 1MB/e vt raw 1 st pass Reconstruction 25Hz 0.5MB reco DST Disk cache Archive storage 15 Event streams TAG/AOD (10-100 kb/evt) Event server Higgs DST TAG/AOD (replica) Higgs background Study (requests New events)

Configuration agent New file discovery Clean-up agent check Assign file to Tier-1 Transfer Manag. DB POOL RLS catalog discover update Tier-1 purge 16 discover add/delete PFN Input Buffer Digi files General Distr. Buffer Reco files copy (read) RM/SRM/SRB EB agent Clean-up agent copy (write) SRM RM SR dc LC

Configuration agent New file discovery Clean-up agent check Assign file to Tier-1 Transfer Manag. DB POOL RLS catalog discover update Tier-1 purge 17 discover add/delete PFN Input Buffer Digi files General Distr. Buffer Reco files copy (read) RM/SRM/SRB EB agent Clean-up agent copy (write) SRM RM SR dc LC

dcache instance at CERN dcache/enstore at FNAL 1TB SRM Control Connection 1TB 18 2.5TB 1TB 622Mbps (03-04/2004) CERN StarLight ESnet FNAL 2.5TB FNAL 1TB

performing Copy srm://serverb/file1 srm://servera/file Application / Client Server A SRM Server A Disk Node Server B SRM Server B GridFtp Node Server B Disk Node Get srm://serverb/file1 Stage and pin /file1 Turl is gsiftp://gridftpnode/file1 Stage and pin completed Delegate user credentials Perform gridftp transfer 19 Start Mover Send data Transfer complete Transfer complete Get done Unpin /file1 Success Unpin completed

Total data transferred to FNAL: 5.2TB (5293GB) Total number of files transferred: 440K Best transfer day in number of files: 18560 Most of the files transferred in the first 12 hours, then waiting for files to arrive at EB. Best transfer day in size of data: 320GB 20 Average filesize was very small: *min 20.8KB *max: 1607.8MB *mean: 13.2MB *median: 581.6KB

Daily data transferred to FNAL 20000 15000 10000 5000 0 21 1-Mar-2004 8-Mar-2004 15-Mar-2004 22-Mar-2004 29-Mar-2004 5-Apr-2004 12-Apr-2004 19-Apr-2004 26-Apr-2004 Number of transferred files

22

23

We used multiple streams (GridFTP) with multiple files per SRM copy command to transfer files: 15 srmcp (gets) in parallel and 30 files in one copy job for a total of 450 files per transfer; This reduced the overhead of authentication and increased the parallel transfer performance; 24 SRM file transfer processes can survive network failure, hardware components failure without any problem Automatic file migration from disk buffer to tape We believe with the shown SRM/dCache setup 30K

Srmcp batches: Transfer scheduler aborts all if single transfer fails (solved in latest version Client failure: Supposed to retry transfer in case of pool failure, selecting a different poo (solved) 25 Space reservation: Prototype available for SC2003; needs to be integrated with SRM v1 (planned for Q4/2004) Information Provider: Need a tightly integrated informatio provider for optimization

EP Jobs are data-intensive important to take data ocation into account eed to integrate scheduling for large - scale data ntensive problems in Grids 26 eplication of data to reduce remote data access

esign goal for current Grid development: ingle generic Grid infrastructure roviding simple and transparent access arbitrary resource types 27 upporting all kinds of applications contains several challenges for Grid scheduling and (storage) resource management

urrent approach: Resource discovery and load-distribution to a remote resource Usually batch job scheduling model on remote machine ut actually required for Grid scheduling is: Co-allocation and coordination of different resource allocations for a Grid job Instantaneous ad-hoc allocation not always suitable 28 his complex task involves: Cooperation between different resource providers Interaction with local resource management systems Support for reservation and service level agreements Orchestration of coordinated resources allocation

epends on Current load of HSM system Number of available tape drives Performance characteristics of tape drives Data location (cache, tape) Data compression rate 29 ccess_cost_storage = time_latency + time_transfer time_latency = tw + tu + tm + tp + tt + td time_transfer = size_file / transfer_rate_cache Waiting for re Unloading idle Mounting tape Positioning Transfer tape Disk cache la

epends on Current load of HSM system Number of available tape drives Performance characteristics of tape drives Data location (cache, tape) Data compression rate 30 ccess_cost_storage = time_latency + time_transfer time_latency = tw + tu + tm + tp + tt + td time_transfer = size_file / transfer_rate_cache Waiting for re Unloading idle Mounting tape Positioning Transfer tape Disk cache la

Information Service Scheduling Service Query for resources static & scheduled/forecasted servation Resources Data Management Service Maintain information Data or 31 Network g g Network Management Service Maintain information ompute Manager nagement System Data Manager Network Manager Management System Network Basic Blocks and Requirements are st to be defined!

vestigations, development and implementation of lgorithms required for decision making process ntelligent Scheduler ethods to pre - determine behavior of a given resource, i.e. a Mass torage Management System by using statistical data from the past t llow for optimization of future decisions 32 urrent implementation requires the SE to act instantaneously on a quest Alternatives allowing to optimize resource utilization include Provisioning (make data available at a given time) Cost associated with making data available at a given time define cost metric could be used to select the least expensive SE SE could provide information as to when would be the most optima time to deliver the requested data

RM/dCache based Grid enabled SE ready to serve EP community rovide end to end, fault tolerance, un-time adaptation, multilevel policy support, eliable and efficient transfers 33 mprove Information Systems and Grid schedulers to erve specific needs in Data Grids (Co - allocation an oordination) ore Information dcache http://www.dcache.org SRM http://sdm.lbl.gov