Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

Similar documents
Efficient Object Storage Journaling in a Distributed Parallel File System

The Spider Center Wide File System

The Spider Center-Wide File System

Oak Ridge National Laboratory

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Parallel File Systems Compared

Parallel File Systems for HPC

Storage Optimization with Oracle Database 11g

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Lustre overview and roadmap to Exascale computing

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

HPC Storage Use Cases & Future Trends

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Data Movement & Storage Using the Data Capacitor Filesystem

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

LLNL Lustre Centre of Excellence

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

New Storage Architectures

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

IBM Emulex 16Gb Fibre Channel HBA Evaluation

An Introduction to GPFS

Coordinating Parallel HSM in Object-based Cluster Filesystems

HCI: Hyper-Converged Infrastructure

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

MAHA. - Supercomputing System for Bioinformatics

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

Oracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE

Emerging Technologies for HPC Storage

High-Performance Lustre with Maximum Data Assurance

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

OLCF's next- genera0on Spider file system

SAS Technical Update Connectivity Roadmap and MultiLink SAS Initiative Jay Neer Molex Corporation Marty Czekalski Seagate Technology LLC

2012 HPC Advisory Council

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

HPSS Treefrog Summary MARCH 1, 2018

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Lustre TM. Scalability

Isilon Performance. Name

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Oracle Exadata: Strategy and Roadmap

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Voltaire Making Applications Run Faster

Diamond Networks/Computing. Nick Rees January 2011

Introducing Panasas ActiveStor 14

An Overview of Fujitsu s Lustre Based File System

Introduction to HPC Parallel I/O

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

The Google File System

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

Lustre at Scale The LLNL Way

1Z0-433

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Data Movement & Tiering with DMF 7

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

A GPFS Primer October 2005

Lustre architecture for Riccardo Veraldi for the LCLS IT Team

SSD Architecture Considerations for a Spectrum of Enterprise Applications. Alan Fitzgerald, VP and CTO SMART Modular Technologies

Open storage architecture for private Oracle database clouds

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Application Performance on IME

Using DDN IME for Harmonie

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

DDN About Us Solving Large Enterprise and Web Scale Challenges

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Lessons learned from Lustre file system operation

InfiniBand Networked Flash Storage

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

SolidFire and Pure Storage Architectural Comparison

RAIDIX Data Storage System. Entry-Level Data Storage System for Video Surveillance (up to 200 cameras)

HPC NETWORKING IN THE REAL WORLD

<Insert Picture Here> Lustre Development

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

<Insert Picture Here> Oracle Storage

High Performance Storage Solutions

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

SFA12KX and Lustre Update

Blue Waters I/O Performance

Cluster Setup and Distributed File System

Flash In the Data Center

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Refining and redefining HPC storage

Xyratex ClusterStor6000 & OneStor

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Transcription:

Lustre at the OLCF: Experiences and Path Forward Galen M. Shipman Group Leader Technology Integration

A Demanding Computational Environment Jaguar XT5 18,688 Nodes Jaguar XT4 7,832 Nodes Frost (SGI Ice) Smoky Lens 224,256 Cores 31,328 Cores 128 Node institutional cluster 300+ TB memory 63 TB memory 80 Node software development cluster 30 Node visualization and analysis cluster 2.3 PFlops 263 TFlops 2

Spider Fastest Lustre file system in the world Demonstrated bandwidth of 240 GB/s on the center-wide file system Largest scale Lustre file system in the world Demonstrated stability and concurrent mounts on all major OLCF systems Jaguar XT5 Jaguar XT4 Opteron Dev Cluster (Spider) Visualization Cluster (Lens) Over 26,000 clients mounting the file system in production Over 282,000,000 files Multiple petabytes of data stored Cutting edge resiliency at scale Demonstrated resiliency features on Jaguar XT5 DM Multipath Lustre Router failover 3

Spider Helping Users and Staff Accessible from all major OLCF resources Decouples simulation platform procurement from storage system enables data sharing between systems without resource consuming data copies procurement Accessible during maintenance windows Powerwall Remote Visualization Cluster Allows file system to take an independent trajectory Procurements can be planned to better coincide with vendor roadmaps End-to-End Cluster Application Development Cluster ESNet SION 192x 48x 192x Login Jaguar - XT4 Jaguar - XT5 Spider 4

Spider - Speeds and Feeds Serial ATA 3 Gbit/sec InfiniBand 16 Gbit/sec XT5 SeaStar2+ 3D Torus 9.6 Gbytes/sec 366 Gbytes/s 384 Gbytes/s 384 Gbytes/s 384 Gbytes/s Jaguar XT5 96 Gbytes/s Jaguar XT4 Other Systems (Viz, Clusters) Enterprise Storage controllers and large racks of disks are connected via InfiniBand. 48 DataDirect S2A9900 controller pairs with 1 Tbyte drives and 4 InifiniBand connections per pair Storage Nodes run parallel file system software and manage incoming FS traffic. 192 dual quad core Xeon servers with 16 Gbytes of RAM each SION Network provides connectivity between OLCF resources and primarily carries storage traffic. 3000+ port 16 Gbit/sec InfiniBand switch complex Lustre Router Nodes run LNET router software forward requests from Lustre file system clients on compute nodes. 192 (XT5) and 48 (XT4) one dual core Opteron nodes with 8 GB of RAM each 5

Snapshot of Technical Challenges Solved Performance Innovations Improved SATA performance (2x Improvement) Network congestion avoidance (2x improvement) Scalability 26,000 file system clients and counting SeaStar Torus Congestion Hard bounce of 7844 nodes via 48 routers 120 Combined R/W MB/s Combined R/W IOPS Percent of observed peak {MB/s,IOPS} 100 80 60 40 Bounce XT4 @ 206s RDMA Timeouts I/O returns @ 435s Full I/O @ 524s OST Evicitions Fault tolerance Design Features System Area Network I/O Servers Storage Arrays 20 Bulk Timeouts Infiniband Support on XT SIO 6 0 0 100 200 300 400 500 600 700 800 900 Elapsed time (seconds)

Operational Experiences Mixed workloads can achieve ~75% of peak read performance and ~60% of peak write performance Write back caching disabled - limits aggregate performance Network (IB and SeaStar) congestion limits aggregate performance Single scratch and project directories across all platforms have been well received Project areas are heavily used Very few users are using Jaguar XT4 direct attached storage System resiliency has been a win Engineering for failure pays untold dividends Over 25 saves by DM-multipath alone 7

Operational Challenges Management of a system of this scale can be daunting ¼ billion+ files make this challenging Many custom tools developed, more on their way Scalable management tools are lacking Diagnosis and recovery utilities to respond to failures System log analysis Log parsing from over 200 servers Log correlation to compute node and application InfiniBand fabric monitoring Single MDS performance Single application collectively creating 220K files Unscalable I/O can impact the entire user population 8

Designed to Support Peak Performance 80 ReadBW GB/s WriteBW GB/s 70 60 Bandwidth GB/s 50 40 30 20 10 0 3/1/10 0:00 3/6/10 0:00 3/11/10 0:00 3/16/10 0:00 3/21/10 0:00 3/26/10 0:00 Timeline(March 2010) Max data rates (hourly) on ½ of the storage controllers 9

Many Workloads are IOPs Intensive 400 Reads/sec Writes/sec 350 300 I/O ops/sec (thousands) 250 200 150 100 50 0 3/1/10 0:00 3/6/10 0:00 3/11/10 0:00 3/16/10 0:00 3/21/10 0:00 3/26/10 0:00 Timeline (March 2010) Max IOPs (hourly) on ½ of the storage controllers 10

Peak Bandwidths - Past 5 Months 100 Read GB/s Write GB/s 90 80 70 Bandwidth GB/sec 60 50 40 30 20 10 0 Nov,2009 Dec,2009 Jan,2010 Feb,2010 March,2010 Time (Month,Year) Max data rates (monthly) on ½ of the storage controllers 11

Managing Our Environment: System Analytics System Metrics Centralized Collection of System Performance, Faults and other Metrics Historical Analysis Snapshots Web Interface Future Integration with DSM Event Log Analysis (under development) Determine Fault Patterns Cluster-based Approach Correlation of Events Log Visualization 12

LustreDU (Beta) Provide functionality similar to the Unix du command. Regular du command could take hours (or days) to run because of the size of the filesystem. LustreDU makes use of the output of ne2scan ne2scan is already run regularly as part of the file purge cycle LustreDU piggybacks on the ne2scan run and thus doesn t require a separate scan of the entire file system One main executable plus daemons running on each OSS Output from ne2scan (which runs on the MDS) contains all needed metadata except for object size Must query an OST for each object s size Two kinds of output: text file or MySQL database 13

Parallel Data Utilities 14 SPDCP (Released!) A parallel Lustre aware copy tool Uses multiple compute nodes for data movement Provides near cp semantics Preserves Lustre striping Demonstrated speedups of 93x over cp PLTAR (Beta) A parallel Lustre aware tar tool Uses multiple compute nodes for data movement Preserved Lustre striping POSIX compliant tar archive Demonstrated speedups of 49x over tar

Path Forward for Spider Complete at scale testing of fine-grained routing Target deployment during our 1.8.x deployment Large EA support still outstanding Shared file limit of 160 OSTs How do we deploy this with 1.8.x? Continue to explore MDS performance enhancements Reduce locking overhead Consider multiple center wide file systems as risk mitigation strategy May target 3 shared file systems in final configuration rather than 2. Provides flexibility to isolate by project or workload Build and improve existing diagnostic and administrative tools Improved tools to locate candidate files following failure Improved performance monitoring and diagnostic tools Mechanisms to correlate system failures / interrupts with aberrant performance 15

Challenges for 2011/2012 ESnet, USN, Teragrid, Internet2, NLR ~20 Petaflop peak > 1 Petabyte system memory ~20K Lustre clients 3.4x bandwidth Increase Up to 7x capacity increase 2011 System RTR RTR RTR RTR OSS 20 GB/sec 384 Routers > 800 GB/sec Jaguar XT5 RTR RTR RTR RTR 192 Routers 240 GB/sec Visualization DDR InfiniBand Network Data Analytics SION QDR InfiniBand Network OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS OSS Other Legacy OSS OSS OSS OSS OSS OSS OSS OSS OSS HPSS Archival 80 GB/sec 48 DDN 9900s 13K Disks 10 PB Grid FTP Servers Lustre WAN Gateways 16

The Memory-Storage Gap Problem Memory constraints may place more burdens on the file system (productive I/O)!"#$%&'()!"+!"*!")!"(!"'!"&!"%!"$,-./0123.4567.-6$!,896 #!!,86:4;6#6<=:7>.?6 @:1A3405 Storage system latency unlikely to improve (HDD)!"#! *$+, Sequential bandwidth will continue to improve but at a fraction of the rate of densities!"#$%&"'%()*+,-(./+)($)0112 &%! &$! &#! &"! &!! %! $! #! "! )*+,-./012345 67728-94-:;<=>2?@A 6772B=:C/.2?@A 4#$5.6& 7+7$#89 :($#-&+ &-'! &''( "!!! "!!( "!&! "!&( 3+-# Courtesy Brad Settlemyer & Sudharshan Vazhkudai (ORNL) 17

What About Flash? Lifetime constraints Performance variability Integration Standalone FLASH file system Hierarchical with bleed down strategies Packaging SSD RAID PCIE Reliability Proper evaluation of Flash requires detailed workload characteristics Duty cycles Aggregate I/O characterization Individual application I/O characterization 18

Flash Not a Silver Bullet Extreme Performance Variability Workload dependent MB/s 240 220 200 180 160 140 120 1MB Sequential 0 10 20 30 40 50 60 Time (Sec) MB/s 19 1800 1700 1600 1500 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 Seq Wr Seq Rd Seq Wr Seq Rd Rand Rd Seq Rd Rand Wr Seq Rd Seq Wr Rand Rd Seq Wr Rand Wr Seq Wr Rand Rd Rand Wr 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 Time (Sec) x 10 80% Write 20% Read 60% Write 40% Read 40% Write 60% Read 20% Write 80% Read

Storage Media 20 Expect marginal improvements in disk drive performance (bandwidth) Latency has flat-lined Mixed media environment may be required to meet capacity and bandwidth requirements separate file systems but globally accessible Integration of Flash needs further evaluation Media Type Total Drives Near- line SAS 20,000 Enterprise SAS 10,000 MLC Flash 4,000 Estimated number of drives required to meet performance of ~800 GB/sec

OLCF-3 File System Lustre is our plan of record for the 2011/2012 system Proven scalability and performance on Jaguar XT5 Leverages our skills and experience in Lustre An evolution of our current parallel file system infrastructure Lower risk, high confidence Component counts (Disks, I/O nodes, etc.) do not increase significantly over our current system (order 2x) New features in Lustre will be needed to improve scalability and reliability Improved metadata performance Improved recovery performance Improved manageability 21

Why Lustre? Meets our requirements today Performance, Scalability, Reliability No other file system meets our current requirements Planned features meet our projected requirements Deep skills in the Lustre parallel file system Development, Deployment, and Operations Active contributor in the Lustre community The only open source parallel file system Provides risk mitigation strategies that closed source does not Broadening community of developers 22

Lustre Features Needed for OLCF-3 Improved Metadata Server (MDS) Performance SMP scaling improvements (Single MDS Server) Clustered Metadata Servers (schedule risk) Ordered operations Epochs Improved data integrity protection/detection Integrated checksums Network packet protection LNET End-to-end desirable Support for larger LUNS Currently limited to 8 TB ORACLE/ORNL work has increased this to 16 TB (Lustre 1.8.2) May need 32 TB support or beyond ZFS preferred path, ext4 enhancements as fallback Predictable performance Network Request Scheduling Quality of Service Mechanisms 23

Lustre and ZFS 24 Lustre currently uses an enhanced ext4 file system for object and metadata storage Lustre moving to add support for the ZFS file system for object and metadata storage ZFS features Capacity large volume support (512 TB) trillions of files Reliability and Integrity Copy on write Integrated checksums Online integrity checking and reconstruction Snapshots Performance concerns LLNL porting ZFS kdmu to Linux Appears to be unsupported at the moment Licensing Linux is released under the GPL, ZFS is CDDL Redistribution of Linux + ZFS may have licensing issues

Implications of License Incompatibilities Vendors such as DDN and LSI may not be able to sell integrated solutions based on Lustre + ZFS + Linux End users can integrate and be in compliance ORNL, LLNL, etc What about support? Lustre + enhanced ext4 + Linux will remain an option for vendors Other backend file systems could also be supported (BTRFS) Is Lustre + Open Solaris an option for vendors? 25

Mitigation Strategies for OLCF-3 Actively engaging Oracle on Lustre Continuing our partnerships with key Lustre users ORNL led weekly Lustre community meetings are ongoing Oracle file system engineers attending Evaluation of a number of Lustre configurations Lustre 1.8 with enhancements Lustre 2.0 (Linux, Open Solaris) ldiskfs and ZFS testing Evaluating alternative file systems (risk mitigation) GPFS Panasas 26

Preparing for OLCF-3 - Testbeds Deployment of next generation file system testbed in 2010 Representative of OLCF-3 storage system Multiple integrated storage controller technologies Mixed SAS/SATA environment InfiniBand Exploring flash integration Capable of testing a variety of file systems and technologies Additional build out of testbeds in 2011 27

Final Thought: Listening to Your Users Need for an open forum to discuss User experiences User requirements Next-generation features Allows aggregation and concentration of requirements Lustre User Group could do this Allow users to run LUG! Run by a LUG board of directors from major Lustre sites Cray User Group and IBM SPXXL are excellent examples 28

Questions? Contact info: Galen Shipman Group Leader, Technology Integration 865-576-2672 gshipman@ornl.gov 29