and the GridKa mass storage system Jos van Wezel / GridKa

Similar documents
LCG data management at IN2P3 CC FTS SRM dcache HPSS

A scalable storage element and its usage in HEP

Understanding StoRM: from introduction to internals

The Global Grid and the Local Analysis

dcache Introduction Course

Introduction to SRM. Riccardo Zappi 1

Benoit DELAUNAY Benoit DELAUNAY 1

Data Access and Data Management

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan

Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008

A Simple Mass Storage System for the SRB Data Grid

Lessons Learned in the NorduGrid Federation

irods usage at CC-IN2P3: a long history

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

Storage Resource Sharing with CASTOR.

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

The INFN Tier1. 1. INFN-CNAF, Italy

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Outline. ASP 2012 Grid School

Virtualizing a Batch. University Grid Center

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Data storage services at KEK/CRC -- status and plan

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

The National Analysis DESY

PROOF-Condor integration for ATLAS

Scientific data management

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Storage and I/O requirements of the LHC experiments

Edinburgh (ECDF) Update

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

Andrea Sciabà CERN, Switzerland

Grid Computing Activities at KIT

IBM IBM Open Systems Storage Solutions Version 4. Download Full Version :

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

GEMSS: a novel Mass Storage System for Large Hadron Collider da

CLOUD-SCALE FILE SYSTEMS

Overcoming Obstacles to Petabyte Archives

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

IBM Tivoli Storage Manager for HP-UX Version Installation Guide IBM

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

Data Storage. Paul Millar dcache

Grid Data Management

HEP Grid Activities in China

HEP replica management

Computing at the Large Hadron Collider. Frank Würthwein. Professor of Physics University of California San Diego November 15th, 2013

The Google File System

The CMS experiment workflows on StoRM based storage at Tier-1 and Tier-2 centers

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

dcache Introduction Course

Insights into TSM/HSM for UNIX and Windows

Influence of Distributing a Tier-2 Data Storage on Physics Analysis

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Batch system usage arm euthen F azo he Z J. B T

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Distributed File Systems II

dcache Ceph Integration

Data Transfers Between LHC Grid Sites Dorian Kcira

Long Term Data Preservation for CDF at INFN-CNAF

NPTEL Course Jan K. Gopinath Indian Institute of Science

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

The PanDA System in the ATLAS Experiment

Experiences in testing a Grid service in a production environment

Data Movement & Tiering with DMF 7

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Chapter 1: Distributed Information Systems

Oracle Database 12c: JMS Sharded Queues

Configuring EMC Isilon

Chapter 2 CommVault Data Management Concepts

Chapter 2: Operating-System Structures. Operating System Concepts 8 th Edition

Operating the Distributed NDGF Tier-1

ATLAS Experiment and GCE

RADU POPESCU IMPROVING THE WRITE SCALABILITY OF THE CERNVM FILE SYSTEM WITH ERLANG/OTP

The ATLAS EventIndex: Full chain deployment and first operation

CS3600 SYSTEMS AND NETWORKS

Utilizing Databases in Grid Engine 6.0

Gustavo Alonso, ETH Zürich. Web services: Concepts, Architectures and Applications - Chapter 1 2

The German National Analysis Facility What it is and how to use it efficiently

Chapter 10: Mass-Storage Systems

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Exam : S Title : Snia Storage Network Management/Administration. Version : Demo

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

IBM Tivoli Storage Manager HSM for Windows Version 7.1. Messages

CA485 Ray Walshe Google File System

Computing for LHC in Germany

Challenges of the LHC Computing Grid by the CMS experiment

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

File System Management

TSM Studio Dataview's and Dataview Commands. TSM Studio

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

dcache: sneaking up on NFS4.1

CMS Computing Model with Focus on German Tier1 Activities

C13: Files and Directories: System s Perspective

HPSS Treefrog Summary MARCH 1, 2018

Computing / The DESY Grid Center

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Transcription:

and the GridKa mass storage system / GridKa

[Tape TSM] staging server 2

Introduction Grid storage and storage middleware dcache h and TSS TSS internals Conclusion and further work 3

FZK/GridKa The GridKa project at Research Center Karlsruhe: Project start in 2002 Construct a compute cluster for use in computing Grids Current capacity: ~3000 cores, 1.5 PB disk, 2 PB tape Commenced as central compute service to the particle physics community in Germany Now serves several other Virtual Organizations (VOs) Main focus at the moment is data processing for the LHC (Large Hadron Collider) LCG: LHC Computing grid WLCG: World wide LCG 4

Planning numbers 13088 Tape (TB) Disk (TB) CPU (# cores) 11059 8629 7238 4397 4720 3810 4622 5166 1788 1744 2070 2007 2008 2009 2010 5

Planned transfers rates to and from tape T0: CERN/data source T1: GridKa/data reprocessing T2: data analysis T0 T1 T2 T1 T1 T1 T1 T2 T1 T1 VO MB/s MB/s avg. MB/s In MB/s avg. MB/s out ALICE 14.9 44.1 15.2 17.8 16.0 ATLAS 88.2 34.1 93.3 99.0 110.3 CMS 26.3 7.0 91.2 63.0 48.0 LHCb 6.3 0.2 20.4 11.4 18.1 SUM 135.6 85.4 220.1 191.2 192.4 6

Libraries GRAU XL (~600 slots / hour ) IBM 3592 (~700 slots / hour) Tape hardware Drives LTO2 8 installed LTO3 24 installed 8 older drives (1 replaced) IO rate 42 MB/s observed mostly less ~25 could be 80? 7

Tape SAN 8

Introduction Grid storage and storage middleware dcache h and TSS TSS internals Conclusion and further work 9

Accessing Grid data Network access via uniform protocol: SRM SRM is software on top of storage management system Connect grid storage islands through a grid Storage Service Network interface to storage You provide data and a combination of: Acess protocol: dcap, rfio, nfs Retention t policy: recover e time needed (custodial, replica, output) Access latency: nearline, online, (offline) / tape, disk, shelf Location management takes care of duplication The storage system (dcache) does the rest for you 10

WLCG storage WLCG uses the Storage Resource Manager From SRM the following storage classes are inferred for the WLCG data management: T1D0: files moved to tape directly T1D1: files migrated to tape but kept on disk as long as there is space or pin times out. T0D1: files on disk only Tape (or mass storage system) is considered custodial storage. Meaning: data is to be kept indefinitely. We do not delete data on tape. 11

dcache Think of it a as a filesystem. It gives you: A: interface to the grid via SRM data storage interface data placement based on SRM classes disk, tape or both B: manages disk storage global name space (within the domain) load balancing access control (via certificates) C: backend to write to and read from permanent storage (tape or other Mass Storage System) GridKa is custodian for, part of, the raw detector data Some computed data also goes to tape 12

Disk pool managers dcache interfaced with TSS/TSM, HPSS, ENSTOR, OSM DPM : the disk pool manager has no mass / archival storage support yet Storm: an SRM on top of GPFS efficiency of interface to TSM is investigated Xroot: in use at particle physics labs HPSS, TSS/TSM 13

dcap, gridftp, xrootd file system metadata on database dcache components interface to grid user (cli) command (p)nfs mount selects pool for write, supplies pool for read 14

Data flow to the T1 with dcache and SRM SRM client GridFTP client Other Grids Open channel SRM soap messages GridFTP Control channel GridFTP Data Channel(s) OPN/Firewa all GridKa SRM server/door GridFTP server/door dcache Pool dcach he TSS Queueing (on space token description) TSM Class 1 Class 2 Class 3 Class n 15

Summary 1 SRM is the entry point for data exchange between grid sites. Disk pool managers offer an SRM interface to disk (and tape) storage The disk pool manager in use at GridKa is dcache dcache intelligently places files on distributed disk storage and coupled mass storage (tape) 16

Introduction Grid storage and storage middleware dcache h and TSS TSS internals Conclusion and further work 17

dcache disk pools and pool nodes Disk pools Disk pools on dcache trigger a callout number of files total size wait time the callout runs on recalls and on migrate requests synchronous to dcache activities 18

Calling sequence dcache provides physical filename: name on the disk pool unique ID: pnfsid storage info: detailed variables logical file name: name as seen by user administrator defined tag tag is set per directory Directory tag is set here follows parent e.g. /pnfs/gridka/vos/ /g / / callout runs UNIX script atlas/disk /tape /cms/disk /tape dc_atlas dc_cms Output t on callout 19

dcache to TSS / TSM 20

Introduction Grid storage and storage middleware dcache dc h and TSS TSS internals Conclusion and further work 21

dcache to TSM previously Original TSM backend 1 file results in 1 store or recall large overhead. Session startup time takes inordinate amount of time when storage agents are used: TSM volume selection algorithm starts cartridge juggle. Efficiency nears zero. No data classes everything goes to one and the same tape no policies i or quota for particular data On recalls no control over tape file order: recalls will be virtually impossible) dcache cannot provide queues (for recalls) Remember: tape allows only sequential access! 22

Requirements for dcache to tape interface Use available TSM base at Forschungszentrum Karlsruhe Improve throughput Reduce number of tape mounts Use different tape sets for different data classes 23

TSS properties Interface directly with TSM via the API Fan out for all dpm/dcache to tape activities mutiple operations: recall, migrate, rename, delete, query Runs on the TSM clients, storage agent or on the server proper Plug-in replacement for the TSM backend that comes with dcache Sends different type of data to different tape sets Two level data classes (with dcache) Queues requests on tape sequence order No persistent state is kept Allows to store an exact image of the logical global name space on tape command line interface to set running parameters, monitor the processing, run db queries (think of it as an alternative dsmc) 24

TSS command and data flow dcache requests Queing Subsystem archive and classes meta-data from tape system requests enter store/recall queue dcache pool Scheduler TSM storage agent Arbiter process selected queue TSM API session channel archive meta-data inter scheduler communication payload data data channel 25

Major components Queuing engine data management: input output files, set data classes enqueue: creates queues Scheduler Select queue to process based on trigger Starts threads to process queue(s) TSM DMI interface handle sessions queries TSM DB setup data transfers sends and receives data Admin interface separate thread to return status information of queues and clients stopping and starting the subsystems changing running parameters 26

TSS Scheduler Scheduler starts t request processing per queue More than one queue may be processed concurrently (allows for big hosts that t handle 2 or more tape drives) Queue is determined d runnable based on: time: elapsed time since first job entry) size: summation of the number of bytes of all files length: number of requests in the queue Communicates with arbiter to prevent drive collisions in next version 27

TSS Queue engine On (recall) entry query the TSM DB if object exists, its id is put in the queue On (migrate) entry select management class Unknown classes are not migrated Renames, deletes etc are forwarded directly Caller waits until TSS returns with the data or a non-zero error code 28

DMI API Wrapper around the API Library of the library Keeps track of open sessions/handles Simplifies queries and Send/Get data calls Utility functions dmi_query_mc(), dmi_log, dmi_session_info() etc. Callbacks separate API lib from rest of the code Example: data Get becomes: if( dmi_init(p1 *, ) == 0) if (dmi_query(p1 *, ) ==0) if (dmi_get(p1 *, ) == 0) return(0); Regretfully no API support for library and or volume handling. 29

Tapeview (VO s) 30

Tapeview (storage agents) 25 30 45 31

Introduction Grid storage and storage middleware dcache h and TSS TSS internals Conclusion and further work 32

Current issues Communication lost errors: (ANS1026E (RC136) The session is rejected: There was a communications protocol error. Multiple TSM clients talking to a single TSM Agent allocate multiple tape drives multiple clients now talk to a single STA No load balancing, diversion for more then 1 library No upstream error detection: library down, no scratch tapes left, no more drives available etc. Interface dcache (java) to TSS is a shell script: i.e. limited signal processing. 33

In progress Queue process arbitration ti via volume pegboard reduce concurrent drive access improve recall throughput synchronous updates would lame operations Concurrent queue processing configurable number of queues processed concurrently on a single host Per queue scheduling parameters different queue triggers for read or write queues finer tuning of write queues Remote queue entry clients connect to a central TSS groups requests (esp. needed for recall) Support for Multiple Tape Libraries 34

Conclusions TSM can handle > 300 MB/s TSS is working as expected Tape speed not the expected rates Need to find out the access pattern/tape mounts Need to have better error recovery Configuration is eeeeh. pretty complex It would be better if TSM could do this 35

Introduction Grid storage and storage middleware TSS internals dcache and TSS Conclusion and further work Many thanks to: Dorin Lobontu, Stephanie Boehringer, Silke Halstenberg, Doris Ressmann, You probably have some questions? 36

Spare slides 37

Data Flow data and meta data 38

storage on the grid: SRM - hide the complexity of the local storage at a site with a uniform interface: SRM - connect grid storage islands through a grid Storage Service - SRM is software on top of storage management system - provide dynamic space allocation/reservation and file management: space management functions - provide dynamic information regarding storage and files: status functions - takes care of authorization and authentification (in the dcache SRM via the gplazma cell): permission i functions - transfer protocol negotiation: data transfer functions - and many other things.. 39