STORAGE @ TGCC OVERVIEW CEA 10 AVRIL 2012 PAGE 1
CONTEXT Data-Centric Architecture Centralized storage, accessible from every TGCC s compute machines Make cross-platform data sharing possible Mutualized Storage Optimal use of resources High Performances Systems Manage more than 10PB of data Connected to computers via Lustre routeurs (100 Go/s) Hierarchical System HSM : Hierarchical Storage Manager LUSTRE filesystem for high performance access Automated migration to tapes (HPSS) PAGE 2
SOFTWARE COMPONENTS CEA 10 AVRIL 2012 PAGE 3
LUSTRE Parallel Filesystem Developed by Intel Data Division(formerly WhamCloud), with support form international labs and organizations OpenSource Product. Half of Top500 machines use it CEA is part of the development. Components MGS (Management Server) MDS (Metadata Server) OSS (Object Storage Server) Routers Clients GL-TGCC Two filesystems : work and store Interconnexion Infiniband QDR 1x metadata cells 2x MDS, 1x DDN SFA10K 10x cells I/O 4x OSS, 1x DDN SFA10K PAGE 4
HPSS High Performance HSM Developed by IBM Third party product, sold as service Components Core server movers disk (disks) movers tape (tapes) ST-TGCC 1x core server 2x cellules mover disk 4x servers, 1x DDN SFA10K 2x cellules mover tapes 3x servers, 4x LTO5 Tape Drive PAGE 5
STORAGE INSIDE TGCC CEA 10 AVRIL 2012 PAGE 6
ARCHITECTURE PAGE 7
FILESYSTEMS work LUSTRE File System Dedicated to short term data, and data sharing on the data-centric (core of the data-centric architecture) Quotas : 1To, 500k inodes per user Not correlated to the HSM store LUSTRE File System Dedicated to long term data Recommended file size : 1Go-100Go up to 1TB Quotas : 100k inodes per user, only inode quota, no quota on volume Automated migration/staging of data in the HSM PAGE 8
LUSTRE CEA 10 AVRIL 2012 PAGE 9
LUSTRE S ARCHITECTURE : INTRODUCTION Lustre : a parallel file system 1 Metadata Server (MDS) per defined filesystem Many Object Storage Servers(OSS) Can deal with thousands of clients Guarantees data and metadata coherency via LDLM (Lustre Distributed Lock Manager) Runs in the kernel space to be closer to hardware Open Source (GPL) PAGE 10
LES SYSTÈMES DE FICHIERS LUSTRE SERVIS PAR GL-TGCC work Core of the data centric architecture Quotas : 1TB, 500k inodes per user «Standalone» file system Mount point: /ccc/work ($CCCWORKDIR) Designed for throughput and performances store Should be used to store final results Connected to a HSM (see later slides) for bigger capacity Recommended file size : 1GB-100GB Quotas : 100k inodes per user, no quota on volume Automated migration and staging with the HSM (see later slides) Mount point: /ccc/store ($CCCSTOREDIR) Designed for data capacity PAGE 11
TGCC: STORAGE ARCHITECTURE PAGE 12
GL-TGCC ARCHITECTURE (DATA-CENTRIC) PAGE 13
GL-TGCC S METADATA CELL Pack of 2 MDS Failover Lustre on 2 MDS 1 MDS for work 1 MDS for store + MGS crossed active/passive failover Storage Backend Metadata records stored in RAID-6 (8+2) with double parity PAGE 14
CELLULE I/O LUSTRE GL-TGCC 4 OSS pack Failover Lustre on 4 OSS Backend storage 28 VDs de 16 To extensible to 24 To Total: 448 To (16 To) up to 672 To (24 To) 7 OSTs / OSS Performance I/O de la cellule 10 Go/s max (Lustre throughput) PAGE 15
LUSTRE S ARCHITECTURE : NETWORK TOPOLOGY GL-TGCC s Infiniband Network PAGE 16
HPSS CEA 10 AVRIL 2012 PAGE 17
WHAT IS A HIERARCHIAL STORAGE MANAGER? HPSS is a Hierarchical Storage Manager (HSM) Data do «sediment» from disks to tapes, via an age based policy New data are still on disks Old data are gone on tape This is not a backup or an archive : no disk/tape replica /ccc/store Niveau disque Niveau bande t 18
HPSS : A PARALLEL HSM Principaux composants : Core server + db2 database Disk movers + disk arrays Tapes movers + tapes drives Serverals separate servers make it possible to extend bandwidth via parallel streams Core server 19 Disk Disk Disk Disk mover mover mover mover Tape Tape mover mover
HSM BINDING CEA 10 AVRIL 2012 PAGE 20
DATA MIGRATION Basements of the HSM store is permanently watched by a Policy Engine (Robinhood) Eligible files for migration are automatically stored in HPSS The filesystem is saved in the HSM Possible recovery in case of crash, major hardware failure, FS been reformatted Older files are Still visible is store with their original size Their contents are out of store and kept in HPSS This is fully transparent to the end-user Space freed in store is available for new files Freed files are staged back at first access Transparent to the end-user The first IO call is blocked until the stage operation is completed PAGE 21
A FILE S LIFE Creation new Copied in HPSS Disk space is freed archived/ synchro Modification released Stage operation HPSS Copy modified/ dirty online 22 offline
USER INTERFACE Users s view: User has access to data via a standardized path: /ccc/store/contxxx/grp/usr ($STOREDIR) No direct access to HPSS, it s «hidden» behind store Regular commands apply to store Accessing a released file stages it back to LUSTRE. The IO is block until transfer is completed ccc_hsm command: ccc_hsm status : query file status (online, released, ) ccc_hsm get : prefetch files ccc_hsm ls : does «ls» but show hsm status (online, offline) too 23
FILES S POPULATION 24
AS A CONCLUSION CEA 10 AVRIL 2012 PAGE 25
COMPUTE RESULTS: BIG IS BEAUTYFUL HSM s point of view: small files suck As much as possible, users MUST avoid storing small files in HSM Smaller files mean more files : huge flat directories is something nothing wants to deal with - Waste of space of HSM s DB containers - Files will be spread on multiple tapes, each of them will require a tape to be mounted, with a big waste of time - They will produce pollution in caches - Kill advantages of «IO Pipelines» by producing «bubbles» IF YOU CAN, MAKE FILE AS BIG AS TAPES ARE (~10GB-100GB) Big Files are nice with you Accessing them results in a single tape mount Very good efficiency in pipelining Allows efficient streams Makes it possible to engage parallel mechanisms PAGE 26
TAR IS YOUR FRIEND TAR is dangerous only in cigarettes Using TAR is an easy (well OK, relatively easy ;-) ) way of packing files TAR as checksuming features to ensure data safety Protects you from silent corruption of data Tools exist to access tarballs from software Tarfiles follow a well known standard See libarchive for example TAR preserves metadata Permissions Owners/groupes TAR preserves symlinks I have an opened mind: you can use cpio if you prefer ;-) Thinking on a framework to perform IO in simulation code PAGE 27 13 février 2014is never a bad idea.
KEEP IN MIND WHAT THE RESOURCES ARE MADE FOR STOREDIR = CAPACITY WORKDIR = SHARING & PERFORMANCE SCRATCH = LOCALITY & PERFORMANCES PAGE 28
ENJOY THE STORAGE PAGE 29