Efficient Object Storage Journaling in a Distributed Parallel File System

Similar documents
The Spider Center-Wide File System

Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

The Spider Center Wide File System

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Oak Ridge National Laboratory

OLCF's next- genera0on Spider file system

CSCS HPC storage. Hussein N. Harake

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Analytics of Wide-Area Lustre Throughput Using LNet Routers

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Lustre overview and roadmap to Exascale computing

Maximizing NFS Scalability

I/O Router Placement and Fine-Grained Routing on Titan to Support Spider II

Preparing GPU-Accelerated Applications for the Summit Supercomputer

High Performance Storage Solutions

Introduction to HPC Parallel I/O

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Coordinating Parallel HSM in Object-based Cluster Filesystems

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

An Overview of Fujitsu s Lustre Based File System

Oak Ridge National Laboratory Computing and Computational Sciences

Parallel File Systems. John White Lawrence Berkeley National Lab

High-Performance Lustre with Maximum Data Assurance

Architecting a High Performance Storage System

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

2012 HPC Advisory Council

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Technical Computing Suite supporting the hybrid system

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

SFA12KX and Lustre Update

Diamond Networks/Computing. Nick Rees January 2011

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Xyratex ClusterStor6000 & OneStor

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Parallel File Systems Compared

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

NetApp High-Performance Storage Solution for Lustre

LLNL Lustre Centre of Excellence

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Using DDN IME for Harmonie

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

DELL Terascala HPC Storage Solution (DT-HSS2)

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Isilon Performance. Name

Initial Performance Evaluation of the Cray SeaStar Interconnect

PESIT Bangalore South Campus

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

Current Topics in OS Research. So, what s hot?

InfiniBand Networked Flash Storage

Parallel File Systems for HPC

LCOC Lustre Cache on Client

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Demonstration Milestone for Parallel Directory Operations

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

outthink limits Spectrum Scale Enhancements for CORAL Sarp Oral, Oak Ridge National Laboratory Gautam Shah, IBM

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

CMS experience with the deployment of Lustre

Dell Fluid Data solutions. Powerful self-optimized enterprise storage. Dell Compellent Storage Center: Designed for business results

A Semi Preemptive Garbage Collector for Solid State Drives. Junghee Lee, Youngjae Kim, Galen M. Shipman, Sarp Oral, Feiyi Wang, and Jongman Kim

Application Performance on IME

Performance and Scalability Evaluation of the Ceph Parallel File System

Dell Compellent Storage Center and Windows Server 2012/R2 ODX

ABySS Performance Benchmark and Profiling. May 2010

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

Finding the Needle Improving Management of Large Lustre File Systems. Presented by David Dillow

Crossing the Chasm: Sneaking a parallel file system into Hadoop

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

How To write(101)data on a large system like a Cray XT called HECToR

Welcome! Virtual tutorial starts at 15:00 BST

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

IBM Spectrum Scale IO performance

InfoSphere Warehouse with Power Systems and EMC CLARiiON Storage: Reference Architecture Summary

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

File Systems for HPC Machines. Parallel I/O

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Enhancing Checkpoint Performance with Staging IO & SSD

COS 318: Operating Systems. Journaling, NFS and WAFL

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Increasing Performance of Existing Oracle RAC up to 10X

New Storage Architectures

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Transcription:

Efficient Object Storage Journaling in a Distributed Parallel File System Presented by Sarp Oral Sarp Oral, Feiyi Wang, David Dillow, Galen Shipman, Ross Miller, and Oleg Drokin FAST 10, Feb 25, 2010

A Demanding Computational Environment Jaguar XT5 18,688 Nodes Jaguar XT4 7,832 Nodes Frost (SGI Ice) Smoky Lens 224,256 Cores 31,328 Cores 128 Node institutional cluster 300+ TB memory 63 TB memory 80 Node software development cluster 30 Node visualization and analysis cluster 2.3 PFlops 263 TFlops 2 FAST 10, Feb 25, 2010

Spider Fastest Lustre file system in the world Demonstrated bandwidth of 240 GB/s on the center wide file system Largest scale Lustre file system in the world Demonstrated stability and concurrent mounts on major OLCF systems Jaguar XT5 Jaguar XT4 Opteron Dev Cluster (Smoky) Visualization Cluster (Lens) Over 26,000 clients mounting the file system and performing I/O General availability on Jaguar XT5, Lens, Smoky, and GridFTP servers Cutting edge resiliency at scale Demonstrated resiliency features on Jaguar XT5 DM Multipath Lustre Router failover 3 FAST 10, Feb 25, 2010

Designed to Support Peak Performance 100.00 ReadBW GB/s WriteBW GB/s 90.00 80.00 70.00 Bandwidth GB/s 60.00 50.00 40.00 30.00 20.00 10.00 0.00 1/1/10 0:00 1/6/10 0:00 1/11/10 0:00 1/16/10 0:00 1/21/10 0:00 1/26/10 0:00 1/31/10 0:00 Timeline (January 2010) Max data rates (hourly) on ½ of available storage controllers 4 FAST 10, Feb 25, 2010

Motivations for a Center Wide File System Building dedicated file systems for platforms does not scale Storage often 10% or more of new system cost Storage often not poised to grow independently of attached machine Different curves for storage and compute technology Data needs to be moved between different compute islands Simulation platform to visualization platform Dedicated storage is only accessible when its machine is available Managing multiple file systems requires more manpower Lens Jaguar XT4 Smoky Jaguar XT5 Ewok Jaguar XT4 SION Network & Spider System Smoky Jaguar XT5 Ewok Lens data sharing path 5 FAST 10, Feb 25, 2010

Spider: A System At Scale Over 10.7 PB of RAID 6 formatted capacity 13,440 1 TB drives 192 Lustre I/O servers Over 3 TB of memory (on Lustre I/O servers) Available to many compute systems through high-speed SION network Over 3,000 IB ports Over 3 miles (5 kilometers) cables Over 26,000 client mounts for I/O Peak I/O performance is 240 GB/s Current Status in production use on all major OLCF computing platforms 6 FAST 10, Feb 25, 2010

Lustre File System Developed and maintained by CFS, then Sun, now Oracle POSIX compliant, open source parallel file system, driven by DOE Labs Metadata server (MDS) manages namespace Object storage server (OSS) manages Object storage targets (OST) OST manages block devices ldiskfs on OSTs V. 1.6 superset of ext3 V. 1.8 + superset of ext3 or ext4 Metadata Server (MDS) Highperformance interconnect Metadata Target (MDT) High-performance Parallelism by object striping Lustre Clients Object Storage Servers (OSS) Object Storage Targets (OST) Highly scalable Tuned for parallel block I/O 7 FAST 10, Feb 25, 2010

SAS connections Spider - Overview Currently providing high-performance scratch space to all major OLCF platforms Aggregation 2 96 DDN S2A9900 Controllers (Singlets) 192 Lustre I/O servers Lens/Everest 60 DDR 5 DDR IB V Smoky 32 DDR 192 4x DDR IB Core 1 Core 2 IB V 64 DDR 64 DDR IB V IBV Aggregation 1 192 4x DDR IB 24 DDR 24 DDR Jaguar XT4 segment SION IB Network 96 DDR Jaguar XT5 segment 96 DDR 13,440 SATA-II Disks 1,344 (8+2) RAID level 6 arrays (tiers) 96 DDR 48 Leaf Switches IB V 96 DDR 192 DDR Spider 8 FAST 10, Feb 25, 2010

Spider - Speeds and Feeds Serial ATA 3 Gbit/sec InfiniBand 16 Gbit/sec XT5 SeaStar2+ 3D Torus 9.6 Gbytes/sec 366 Gbytes/s 384 Gbytes/s 384 Gbytes/s 384 Gbytes/s Jaguar XT5 96 Gbytes/s Jaguar XT4 Other Systems (Viz, Clusters) Enterprise Storage controllers and large racks of disks are connected via InfiniBand. 48 DataDirect S2A9900 controller pairs with 1 Tbyte drives and 4 InifiniBand connections per pair Storage Nodes run parallel file system software and manage incoming FS traffic. 192 dual quad core Xeon servers with 16 Gbytes of RAM each 9 FAST 10, Feb 25, 2010 SION Network provides connectivity between OLCF resources and primarily carries storage traffic. 3000+ port 16 Gbit/sec InfiniBand switch complex Lustre Router Nodes run parallel file system client software and forward I/O operations from HPC clients. 192 (XT5) and 48 (XT4) one dual core Opteron nodes with 8 GB of RAM each

Spider - Couplet and Scalable Cluster 280 1TB Disks in 5 disk Disks trays 280 in Disks 5 trays 280 in 5 trays DDN S2A 9900 DDN Couplet Couplet (2 (2 controllers) DDN Couplet (2 controllers) 16 SC units on the floor 2 racks for each SC 24 IB ports Lustre I/O Servers Flextronics 24 IB Switch ports 24 IB ports OSS (4 Dell Flextronics Switch (4 Dell nodes) nodes) IB Flextronics Switch OSS (4 Dell nodes) Ports IB Ports IB Ports Uplink Uplink to to Cisco Cisco Core Core Uplink Switch Switch to Cisco Core Switch Unit 1 Unit 2 Unit 3 SC SC SC SC SC SC SC SC SC SC SC SC SC SC SC SC A Spider Scalable Cluster (SC) 10 FAST 10, Feb 25, 2010

Spider - DDN S2A9900 Couplet Controller1 Controller2 Power Supply (House) Power Supply (House) Power Supply (UPS) Power Supply (UPS) IO Module IO Module IO Module IO Module IO Module A1 B1 C1 D1 E1 F1 G1 H1 P1 S1 A2 B2 C2 D2 E2 F2 G2 H2 P2 S2 IO Module IO Module IO Module IO Module IO Module A1 DEM D1... D14 DEM A2 DEM D15... D28 DEM B1 DEM D29... D42 DEM B2 DEM D43... D56 DEM Power Supply (House) Power Supply (UPS) Disk Enclosure 1 Disk Enclosure 2... Disk Enclosure 5 11 FAST 10, Feb 25, 2010

Spider - DDN S2A9900 (cont d) RAID 6 (8+2) Disk Controller 1 Disk Controller 2 Channel A Channel B D 1A D 2A... D 1B D 2B... D 14A D 14B D 15A D 16A... D 15B D 16B... D 28A D 28B Channel A Channel B 8 data drives Channel P D 1P D 2P... D 14P D 15P D 16P... D 28P Channel P 2 parity drives Channel S D 1S D2S... D14S D 15S D 16S... D 28S Channel S... Tier1 Tier 2 Tier 14... Tier 15 Tier 16 Tier 28 12 FAST 10, Feb 25, 2010

Spider - How Did We Get Here? 4 years project We didn t just pick up phone and order a center-wide file system No single vendor could deliver this system Trail blazing was required Collaborative effort was key to success ORNL Cray DDN Cisco CFS, SUN, and now Oracle 13 FAST 10, Feb 25, 2010

Spider Solved Technical Challenges Performance Asynchronous journaling Network congestion avoidance Scalability 26,000 file system clients SeaStar Torus Congestion Hard bounce of 7844 nodes via 48 routers Percent of observed peak {MB/s,IOPS} 120 100 80 60 40 20 Bounce XT4 @ 206s RDMA Timeouts I/O returns @ 435s Full I/O @ 524s Bulk Timeouts Combined R/W MB/s Combined R/W IOPS OST Evicitions Fault tolerance design Network I/O servers Storage arrays Infiniband support on XT SIO 0 0 100 200 300 400 500 600 700 800 900 Elapsed time (seconds) 14 FAST 10, Feb 25, 2010

ldiskfs Journaling Overhead Even sequential writes exhibit random I/O behavior due to journaling Observed 4-8 KB writes along with 1 MB sequential writes on DDNs DDN S2A9900 s are not well tuned for small I/O access For enhanced reliability write-back cache on DDNs are turned off Special file (contiguous block space) reserved for journaling on ldiskfs Labeled as journal device Beginning on physical disk layout Ordered mode After file data portion committed on disk journal meta data portion needs to be committed Extra head seek needed for every journal transaction commit! 15 FAST 10, Feb 25, 2010

ldiskfs Journaling Overhead (Cont d) Block level benchmarking (writes) for 28 tiers 5608.15 MB/s (baseline) File system level benchmark (obdfilter) gives 1398.99 MB/s 24.9% of baseline bandwidth One couplet, 4 OSS each with 7 OSTs 28 clients, one-to-one mapping with OSTs Analysis Large number of 4KB writes in addition to 1MB writes Traced back to ldiskfs journal updates )'*+,"-./0 12-./0 Table 1: XDD baseline performance!"#$ %&'("!"#$"%&'() *+,-*../,-0, 1(%234 565-70 8*-77!"#$"%&'(),,76-5,,*6+-5, 1(%234.7,/-+7.,/6-, 16 FAST 10, Feb 25, 2010

Minimizing extra disk head seeks Hardware solutions External journal on an internal SAS tier External journal on a network attached solid state device Software solution Asynchronous journal commits Configura>on Bandwidth MB/s (single couplet) Delta % from baseline Block level (28 @ers) 5608.15 0% Internal journals, SATA 1398.99 24.9% External, internal SAS @er 1978.82 35.2% External, sync to RAMSAN, solid state 3292.60 58.7% Internal, async journals, SATA 5222.95 93.1% 17 FAST 10, Feb 25, 2010

External journals on a solid state device Texas Memory Systems RamSan-400 Loaned by Vion Corp. Non-volatile SSD 3 GB/s block I/O 400,000 IOPS 4 IB DDR ports w/ SRP 28 LUNs One-to-one mapping with DDN LUNs Obtained 58.7% of baseline performance Network round-trip latency or inefficiency on external journal code path might culprit 4 DDR TMS RamSan-400 Core 1 Lens/Everest IB V 96 DDR 60 DDR 5 DDR IB V 64 DDR 64 DDR IB V 96 DDR Aggregation 2 32 DDR Aggregation 1 24 DDR 24 DDR Jaguar XT4 segment Jaguar XT5 segment 48 Leaf Switches IB V 192 DDR 96 DDR 96 DDR Smoky Core 2 IBV Spider 18 FAST 10, Feb 25, 2010

Synchronous Journal Commits Running and closed transactions Running transaction accepts new threads to join in and has all its data in memory Closed transaction starts flushing updated metadata to journaling device. After flush is complete, the transaction state is marked as committed Current running transaction can t be closed and committed until closed transaction fully commits to journaling device Congestion points Slow disk Journal size (1/4 of journal device) Extra disk head seek for journal transaction commit Write I/O operation for new threads is blocked on currently closed transaction that is committing RUNNING Updated metadata blocks are written from memory to journaling device CLOSED The running transaction is marked as CLOSED in memory by Journaling Block Device (JBD) Layer File data is flushed from memory to disk The file data must be flushed to disk prior to committing the transaction Updated metadata blocks flushed to disk COMMITTED 19 FAST 10, Feb 25, 2010

Asynchronous Journal Commits Change how Lustre uses the journal, not the operation of journal Every server RPC reply has a special field (default, sync) id of the last transaction on stable storage Client uses this to keep a list of completed, but not committed operations In case of a server crash these could be resent (replayed) to the server Clients pin dirty and flushed pages to memory (default, sync) Released only when server acks these are committed to stable storage Relax the commit sequence (async) Add async flag to the RPC Reply clients immediately after file data portion of RPC is committed to disk 20 FAST 10, Feb 25, 2010

Asynchronous Journal Commits (cont d) 1. Server gets destination object id and offset for write operation 2. Server allocates necessary number of pages in memory and fetch data from remote client into pages 3. Server opens a transaction on the back-end file system. 4. Server updates file metadata, allocates blocks and extends file size 5. Server closes transaction handle 6. Server writes pages with file data to disk synchronously 7. If async flag set server completes operation asynchronously Server sends a reply to client JBD flushes updated metadata blocks to journaling device, writes commit record 8. If async flag is NOT set server completes operation synchronously JBD flushes updated metadata blocks to journaling device, writes commit record Server sends a reply to client 21 FAST 10, Feb 25, 2010

Asynchronous Journal Commits (cont d) Async journaling achieves 5,223 MB/sec (at file system level) or 93% of baseline Cost effective Requires only Lustre code change Easy to implement and maintain Temporarily increases client memory consumption Clients have to keep more data in memory until the server acks the commit Aggregate Block I/O (MB/s) 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Hardware- and Software-based Journaling Solutions external journals on a tier of SAS disks external journals on RamSan-400 device internal journals on SATA disks async internal journals on SATA disks 0 100 200 300 400 500 600 Number of Threads Does not change the failure semantics or reliability characteristics The guarantees about file system consistency at the local OST remain unchanged 22 FAST 10, Feb 25, 2010

Asynchronous Journal Commits Application Performance Gyrokinetic Toroidal Code (GTC) 896%:;<0%=>/"<;6"10-?%!")!"!*%!"!&"+&%!"!,")(%!"!*"'$%!"!'")#%!"!("*+%!"!)"($%!"!!"!!%!"!#"!$% $'%-./01%23% 4156-%.7% @:A%896%:;<01%!"!&"'&% $'%-./01%23% 4156-%.6% Reduced number of small I/O requests 64% to 26.5%!"!'"()% )+''%-./01%23% 4156-%.7%!"!(")(% )+''%-./01%23% 4156-%.6% Up to 50% reduction in runtime./0123#45#63782#9:;#32</2=8=#>?@# $!!"!!#,!"!!# +!"!!# *!"!!# )!"!!# (!"!!# '!"!!# &!"!!# %!"!!# Might not be typical Depends on the application 9:;#32</2=8#=7A2#E7=8371/F4B#543#GHI#3/B=#678J#$K&''#637823=# $!"!!#!"!!#!#-#$%*# ($%#-#(%'#,$%#-,%'# $!!+-$!&)# $(&)#-#$('+# %!'+#-#%!)!# 9:;#32</2=8#=7A2=#>7B#CD@# LMBN#O4/3BPQ=# R=MBN#O4/3BPQ=# 23 FAST 10, Feb 25, 2010

Conclusions A system at this scale, we can t just pick up the phone and order one No problem is small when you scale it up At Lustre file system level we obtained 24.9% of our baseline block level performance Tracked to ldiskfs journal updates Solutions External journals on an internal SAS tier; achieved 35.2% External journals on network attached SSD; achieved 58.7% Asynchronous journal commits; achieved 93.1% Removed a bottleneck from critical write path Decreased 4 KB I/O DDNs observed by 37.5% Cost-effective, easy to deploy and maintain Temporarily increases client memory consumption Doesn t change failure characteristics or semantics 24 FAST 10, Feb 25, 2010

Questions? Contact info Sarp Oral oralhs at ornl dot gov Technology Integration Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory 25 FAST 10, Feb 25, 2010