Administering Lustre 2.0 at CEA

Similar documents
LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Everyday* Lustre. A short survey on Lustre tools. Sven Trautmann Engineer, Lustre Giraffe Team, Sun Microsystems

MANAGING LUSTRE & ITS CEA

Lustre overview and roadmap to Exascale computing

RobinHood Project Status

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Nathan Rutman SC09 Portland, OR. Lustre HSM

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Lustre usages and experiences

Multi-Rail LNet for Lustre

Fan Yong; Zhang Jinghai. High Performance Data Division

Introduction to Lustre* Architecture

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

The modules covered in this course are:

Project Quota for Lustre

New Storage Architectures

Community Release Update

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Disruptive Storage Workshop Hands-On Lustre

Red Hat Enterprise 7 Beta File Systems

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

Robin Hood 2.5 on Lustre 2.5 with DNE

An Overview of Fujitsu s Lustre Based File System

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

FROM RESEARCH TO INDUSTRY. RobinHood v3. Robinhood User Group Thomas Leibovici 16 septembre 2015

RobinHood Policy Engine

ROBINHOOD POLICY ENGINE

Lustre Interface Bonding

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Linux File Systems: Challenges and Futures Ric Wheeler Red Hat

DL-SNAP and Fujitsu's Lustre Contributions

CephFS: Today and. Tomorrow. Greg Farnum SC '15

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Assistance in Lustre administration

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Diagnosing Performance Issues in Cray ClusterStor Systems

Parallel File Systems Compared

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

LLNL Lustre Centre of Excellence

Challenges in making Lustre systems reliable

Fujitsu s Contribution to the Lustre Community

Lessons learned from Lustre file system operation

Community Release Update

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Demonstration Milestone Completion for the LFSCK 2 Subproject 3.2 on the Lustre* File System FSCK Project of the SFS-DEV-001 contract.

CMS experience with the deployment of Lustre

Zhang, Hongchao

RobinHood Project Update

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Fujitsu's Lustre Contributions - Policy and Roadmap-

Experiences with HP SFS / Lustre in HPC Production

Enhancing Lustre Performance and Usability

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Lustre Metadata Fundamental Benchmark and Performance

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Lustre at Scale The LLNL Way

HLD For SMP node affinity

P a g e 1. Teknologisk Institut. Online kursus k SysAdmin & DevOps Collection

Lustre 2.8 feature : Multiple metadata modify RPCs in parallel

Toward Scalable Monitoring on Large-Scale Storage for Software Defined Cyberinfrastructure

Scale-Out backups with Bareos and Gluster. Niels de Vos Gluster co-maintainer Red Hat Storage Developer

6.5 Collective Open/Close & Epoch Distribution Demonstration

PROACTIVE IDENTIFICATION AND REMEDIATION OF HPC NETWORK SUBSYSTEM FAILURES

February 15, 2012 FAST 2012 San Jose NFSv4.1 pnfs product community

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

RobinHood v3. Integrity check & thoughts on local VFS changelogs. Robinhood User Group Dominique Martinet

Extraordinary HPC file system solutions at KIT

Lustre High availability configuration in CETA-CIEMAT

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Parallel File Systems for HPC

NFS-Ganesha w/gpfs FSAL And Other Use Cases

Correlating Log Messages for System Diagnostics

Lustre / ZFS at Indiana University

High Performance Computing. NEC LxFS Storage Appliance

Remote Directories High Level Design

File system manager. Single tool to manage your storage. Red Hat. October 24, 2011

Upstreaming of Features from FEFS

The Spider Center-Wide File System

Enterprise Filesystems

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

Before Contacting Technical Support

Toward An Integrated Cluster File System

Oracle Linux 5 & 6 Advanced Administration

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

SQL Server on Linux and Containers

Shared File System Requirements for SAS Grid Manager. Table Talk #1546 Ben Smith / Brian Porter

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

LNET MULTI-RAIL RESILIENCY

Improving overall Robinhood performance for use on large-scale deployments Colin Faber

Andreas Dilger, Intel High Performance Data Division LAD 2017

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Andreas Dilger, Intel High Performance Data Division SC 2017

Release Notes for Patches for the MapR Release

Transcription:

Administering Lustre 2.0 at CEA European Lustre Workshop 2011 September 26-27, 2011 Stéphane Thiell CEA/DAM stephane.thiell@cea.fr

Lustre 2.0 timeline at CEA 2009 / 04 2010 / 04 2010 / 08 2011 Lustre 2.0 Alpha-X Lustre 2.0 Beta-1 Lustre 2.0 Lustre 2.1 TERA+ TERA100 demonstrator TERA100 TGCC (curie) Early adoption of Lustre 2.0 Benefit from 2.0 features in new production machines FIDs (eg. NFS-Ganesha FSAL/LUSTRE) Changelogs Ease future 2.x upgrades of production machines Lustre HSM binding Push forward Lustre technology 1

Administering Lustre 2.0 at CEA - Outline Operations & Maintenance Shine Lessons learned Performance tunings Monitoring Features Resolved issues 2

Lustre 2.0 Operations & Maintenance Lustre administration software ecosystem at CEA Base suites DM-Multipath (Linux native multipath I/O) Shine (Lustre FS management and tuning) Robinhood Policy Engine (Powerful FS content manager) Puppet (Centralized configuration management) Pacemaker (High availability) Monitoring Syslog-ng (Versatile replacement for syslogd) SEC (Simple Event Correlator) Nagios (Supervision) RRDTool (Graphing engine) Debugging crash (dumps or live systems debugging) 3

Shine to manage Lustre 2.0 filesystems (1/5) Shine Python Library and Tools for Lustre Setup and manage Lustre filesystem components in parallel Open Source (GPL): http://lustre-shine.sf.net/ Human-readable configuration file to describe each FS # file: /etc/shine/models/simple1.lmf - cluster: demo fs_name: simple1 description: Simple Lustre filesystem model file example nid_map: nodes=demo[0-999] nids=demo[0-999]-ib0@o2ib0 mgt: node=demo110 ha_node=demo111 dev=/dev/mapper/mgt mdt: node=demo113 ha_node=demo114 dev=/dev/mapper/s1_mdt ost: node=demo200 ha_node=demo201 dev=/dev/mapper/s1_ost[0-3] index=[0-3] ost: node=demo201 ha_node=demo200 dev=/dev/mapper/s1_ost[4-7] index=[4-7] client: node=demo[50-55,400-759] mount_path: /demo/simple1 stripe_size: 1048576 stripe_count: 2 4

Shine to manage Lustre 2.0 filesystems (2/5) # file: /etc/shine/models/more2.lmf - cluster: demo fs_name: more2 description: More advanced Lustre filesystem model file example nid_map: nodes=demo[0-999] nids=demo[0-999]-ib0@o2ib0 # Externally managed MGS mgt: node=demo110 ha_node=demo111 mode=external mdt: node=demo113 ha_node=demo114 dev=/dev/mapper/m2_mdt ost: node=demo200 ha_node=demo201 dev=/dev/mapper/m2_ost[0-3] index=[0-3] ost: node=demo201 ha_node=demo200 dev=/dev/mapper/m2_ost[4-7] index=[4-7] client: node=demo[50-55,400-759] mount_path: /demo/more2 mount_options: acl,user_xattr,flock mdt_mount_options: acl,user_xattr stripe_size: 1048576 stripe_count: 2 mdt_mkfs_options: -O dir_index,dirdata,uninit_bg,mmp,flex_bg -G 256 ost_mkfs_options: -m2 -O dir_index,extents,uninit_bg,mmp,flex_bg -G 256 quota: yes 5

Shine to manage Lustre 2.0 filesystems (3/5) Use case: removing OST made easy with Shine (v0.910+) Purge OST #7 files (list and backup files before if needed) # robinhood --purge-ost=7,0 -i Stop all filesystem components # shine umount f demofs # shine stop f demofs Remove OST #7 definition from /etc/shine/models/demofs.lmf ost: node=demo201 dev=/dev/mapper/ost7 index=7 #ost: node=demo201 dev=/dev/mapper/ost7 index=7 Update distributed shine FS config files from new model # shine update m /etc/shine/models/demofs.lmf Run needed tunefs.lustre in parallel # shine tunefs f demofs (perform writeconf in this case) Start filesystem again # shine start f demofs # shine mount f demofs 6

Shine to manage Lustre 2.0 filesystems (4/5) Manage LNET routers easily with Shine One line in config file to define routers for a filesystem (.lmf):... router: node=demo[120-137]... Centralized routers management (start, stop, status) # shine status -t router -f demofs FILESYSTEM COMPONENTS STATUS (demofs) +-----+---+---------------+-------+ type # nodes status +-----+---+---------------+-------+ ROU 18 demo[120-137] online +-----+---+---------------+-------+ 7

Shine to manage Lustre 2.0 filesystems (5/5) Shine: advanced features Multi-NIDs support nid_map rule can be repeated Optional static network definition per target nid_map: nodes=demo[0-999] nids=demo[0-999]-ib0@o2ib0 nid_map: nodes=demo[0-999] nids=demo[0-999]-ib1@o2ib1 ost: node=demo200 dev=/dev/mapper/ost[0-8/2] index=[0-8/2] network=o2ib0 ost: node=demo201 dev=/dev/mapper/ost[1-9/2] index=[1-9/2] network=o2ib1 8

Lustre 2.0 Operations: lessons learned Enable flex_bg ext4 feature ost_mkfs_options: -O dir_index,extents,mmp,uninit_bg,flex_bg -G 256 Enabled by default in Lustre 2.1 (LU-255) fsck time is not a problem anymore for us (as of today) Tune backend filesystem according to hardware Router status coherency Strictly avoid any incoherent status from both sides of router Need to check lctl route_list output from both sides DOWN UP o2ibx o2iby UP LNET router UP LNET router 9

Lustre 2.0 Operations: performance tunings Lustre tunings (clients) osc.*.max_pages_per_rpc=256 llite.*.max_cached_mb Lost feature as of Lustre 2.0 (LU-141) Kernel tunings Increase vm.min_free_kbytes on clients vm.min_free_kbytes=262144 vm.zone_reclaim_mode (depending on hardware) vm.zone_reclaim_mode=0 pages reclaim (and allocation) satisfied from all zones stable (but not best) performance with Lustre vm.zone_reclaim_mode=1 pages reclaim from remote zones are avoided variable, long-term inconsistent performance not good until CPU-affinity thread pools in Lustre 10

Lustre 2.0 Monitoring (1/3) Lustre activity monitoring Track user creativeness with Robinhood alerts Home-made Lustre activity monitoring solution ClusterShell scripts to gather Lustre metrics http://clustershell.sf.net/ RRDTool graphs «Overlook» 11

Lustre 2.0 Monitoring (2/3) Lustre RPC monitoring on MDS Track nasty behavior 12

Lustre 2.0 Monitoring (3/3) Lustre RPC tracing User/Appl Jobs slurm (sacct) lustre_rpc_sample.py Lustre clients (NIDs) / Nodes RPC types #./lustre_rpc_sample.py analyze f \ lascaux113_11-09-16_11:40:07.log.gz -D Reading sample file lascaux113_11-09-16_11:40:07.log.gz GENERAL Duration: 5.0 sec Total RPC: 53776 (10775.1 RPC/sec) Ignored ping RPC: 582 NODENAME COUNT % CUMUL DETAILS lascaux3283 1536 2.86 2.86 mds_getattr: 1536 lascaux3191 1536 2.86 5.71 mds_getattr: 1536 lascaux3181 1536 2.86 8.57 mds_getattr: 1536 lascaux3188 1536 2.86 11.43 mds_getattr: 1536 lascaux3189 1536 2.86 14.28 mds_getattr: 1536 lascaux3185 1536 2.86 17.14 mds_getattr: 1536 lascaux3187 1536 2.86 19.99 mds_getattr: 1536 lascaux3146 1536 2.86 22.85 mds_getattr: 1536 lascaux3190 1536 2.86 25.70 mds_getattr: 1532 13

Lustre 2 at CEA: Enabled features Quotas Quotas with RHEL6 issues (LU-91, LU-369, LU-484) Inodes and volume quotas enabled on TGCC (curie) Changelogs Patchs applied (bz23035, bz23298, bz23120, LU-81, LU-542) Enabled on TGCC (curie) for faster Robinhood mdd.*.changelog_mask "MARK CREAT MKDIR HLINK SLINK UNLNK RMDIR RNMFM RNMTO TRUNC SATTR MTIME" Large OSTs support (>16TB) Added by LU-136 (available in Lustre 2.1) 14

Lustre 2 at CEA: Major issues resolved since 2.0 GA Client hang issues (resulting to lots of evictions) LU-416, LU-437 (LU-394) Client crash on special I/O transfert nodes LU-185 I/O errors when using DM-multipath (RHEL6) LU-275 OSS pseudo-hang after OSTs stop/start LU-328 OSS frequent crashes due to LBUG/[ASSERTION(last_rcvd >= le64_to_cpu(lcd->lcd_last_transno)) bz24420 Binary execve() over Lustre FS issues LU-300, LU-613 15

Conclusion Lustre 2 stability Patched Lustre 2.0 is now very stable at CEA So will be Lustre 2.1 as all major patches have been integrated Thanks to Whamcloud and Bull Lustre teams Both very responsive Admin tools Good, scalable Lustre admin tools save a lot of time We will continue to improve our Open Source tools Lustre activity profiling and tracing Things are already possible today Improvements needed for the future 16

Shine http://lustre-shine.sf.net/ Questions? ClusterShell http://clustershell.sf.net/ Robinhood Policy Engine http://robinhood.sf.net/ NFS-Ganesha FSAL/LUSTRE http://nfs-ganesha.sf.net/