Sep 2014 SFA12KX and Lustre Update Maria Perez Gutierrez HPC Specialist HPC Advisory Council
Agenda SFA12KX Features update Partial Rebuilds QoS on reads Lustre metadata performance update 2
SFA12KX Features Update 3
Big Data & Cloud Infrastructure DDN FY14 Focus Areas Big Data Platform Management DirectMon * GA = Future Release Analytics Reference Architectures SATA FY14 SFA12KX/E 40GB/s/1.7M IOPS 1,680 Drives supported Embedded Computing Petascale Lustre Storage EXAScaler 10Ks of Clients 1TB/s+, HSM Linux HPC Clients NFS SFA7700 13GB/s, 600K IOPS 7700x 7700 E* Enterprise Scale-Out File Storage GRIDScaler ~10K Clients 1TB/s+, HSM Linux/Windows HPC Clients NFS & CIFS Storage Fusion Architecture Core Storage Platforms Flexible Drive Configuration SAS SSD Infinite Memory Engine Distributed File System Buffer Cache* [Demo] SFX Automated Flash Caching Read Context Commit Instant Commit WolfCreek FY14 FY14 Cloud Foundation S3/Swift Cloud Tiering WOS 3.0 32 Trillion Unique Objects Geo-Replicated Cloud Storage 256 Million Objects/Second Self-Healing Cloud Parallel Boolean Search WOS7000 60 Drives in 4U Self-Contained Servers Highlights Platform SFA Hardening Higher speed Embedded WolfCreek 7700E* 7700x* Full speed on IME PFS acceleration More use cases under review WOS S3/Swift WOS Access Cost reduction Performance improvements DDN Confidential NDA Required, Roadmap Subject to Change
SFA 12K Family addition: SFA12KX High-Scale Active/Active Block Storage Appliance, Available May 2014 Specifications Appliances CPU Memory & Battery-Backed Cache SFX Flash cache Active/Active Block Controllers Dual Socket Intel 8Core Ivy-Bridge 128GB DDR3 1866 64GB Mirrored Up to 12.8 TB Write Intensive SSDs SFA12KX with 20x SS8460 Enclosures RAID Levels RAID 1/5/6 IB Host Connectivity 16 x 56Gb (FDR) FC Host Connectivity 32 x 16Gb/s Drive Support 2.5 and 3.5 SAS, SATA, SSD Max JBOD Expansion Up to 20 Supported JBODs SS7000, SS8460 SFA12KX GA: Q2 2014 Bandwidth (GB/s) 50 40 Write* Read* 41 48 * Large Block, Full Stripe ** Read I/O Includes Parity Verification 30 Infiniband SRP * Estimated Performance DDN Confidential NDA Required, Roadmap Subject to Change
Introducing 12KX (Q2 2014) Rate (MB/s) 25,000 20,000 15,000 10,000 5,000 0 Random Write Throughput 1 8 16 24 32 40 48 56 Pool Count Sequential Write Throughput 512K I/O Size 1M I/O Size 2M I/O Size 4M I/O Size 8M I/O Size ü Over 40 GB/s Reads AND Writes ü Over 20 GB/s MIRRORED Writes ü Linear Scalability ü SFX Ready ü Latest Intel Processor ü 32 Ports FC-16 Rate (MB/s) 45,000 40,000 35,000 30,000 25,000 20,000 15,000 10,000 5,000 0 1 8 16 24 32 40 48 56 Pool Count 64K I/O Size 256K I/O Size 512K I/O Size 1M I/O Size 2M I/O Size 4M I/O Size 8M I/O Size
SFA Feature Baseline Priority, Queuing, Real-Time QoS for Read Operations; Critical for Striped File Performance Consistency ReACT IO Fairness Lowest Latency for Small IO Highest Bandwidth for Large & Streaming I/O Dialed-in Host I/O priority during rebuilds Storage Fusion Fabric Highly-Over-Provisioned Backend - RAIDed Fabric Withstands Failures of Drives, Enclosures, Cabling, etc Performance, Flash Real-Time, Multi-CPU RAID Engine Interrupt-Free; Massively Parallel I/O Delivery System RAID Rebuild Acceleration SFX Cache, Automated, data-driven caching system enabling hybrid SSD & HDD systems Reliabiltiy, Availability & Monitoring - Mgmt Raid 10 (8+2) Fast Pool Rebuilds partial rebuilds Rebuild priority adjustable SSD Life-counter High Density Array Up to 840 HDDs in a single rack. 84 HDDs / SSDs in 4U Data Integrity & Security DirectProtect RAID1/5/6 Real-Time Data Integrity Verification for every I/O SED management Data at-rest encryption of all user data. Instant Crypto Erase DDN Confidential NDA Required, Roadmap Subject to Change
SFA Partial Rebuild Example persistent partial rebuild 1 2 3 4 5 Complete enclosure removed for F/W upgrade. Controllers send I/O destined for failed enclosure to available drives. Holds corresponding metadata in synchronous mirrored cache. Controller 1 fails. I/O continues. Complete system outage due to power failure. controller 2 writes cache to onboard drives. Power restored. Controller cache restored. Upgraded Enclosure undergoes partial rebuild of cached data in minutes. I/O I/O I/O I/O 84 * 4TB Disks (336TB) removed for hours, rebuilt in minutes
SFA Quality of Service Read retry timeouts coupled with DirectProtect DIF Raid6 (8+2) of NL-SAS Disks, one of the member has ~100% higher avg latencies than others. Production not impacted thanks to DDN s QoS
Lustre Metadata Benchmark and Performance / How to scale on Lustre metadata performance 10
Lustre Metadata Performance Lustre metadata is a crucial performance metric for many Lustre user LU-56 SMP Scaling (Lustre-2.3) DNE (Lustre-2.4) Metadata performance is related to small file performance on Lustre But, metadata performance is still a little mysterious J Performance differentiation by metadata type and access patterns? What is the impact of hardware resources for metadata performance? This presentation: use standard metadata benchmark tools to analyze metadata performance on Lustre today
Lustre Metadata Benchmark Tools mds-survey Build into Lustre code Similar to obdfilter-suvey Generates loads on MDS to simulate Lustre metadata performance mdtest Major metadata benchmark tool used by many large HPC sites Runs on clients using MPI Several metadata operation and access patterns are supported
Single Client Metadata Performance Limitation Single client Metadata performance does not scale with threads. ops/sec 60000 50000 40000 30000 20000 10000 0 Single Client Metadata Performance (Unique) File creation File stat File removal 1 2 4 8 16 Number of Threads lustre/include/lustre_mdc.h /** * Serializes in-flight MDT-modifying RPC requests to preserve idempotency. * * This mutex is used to implement execute-once semantics on the MDT. * The MDT stores the last transaction ID and result for every client in * its last_rcvd file. If the client doesn't get a reply, it can safely * resend the request and the MDT will reconstruct the reply being aware * that the request has already been executed. Without this lock, * execution status of concurrent in-flight requests would be * overwritten. * * This design limits the extent to which we can keep a full pipeline of * in-flight requests from a single client. This limitation could be * overcome by allowing multiple slots per client in the last_rcvd file. */ struct mdc_rpc_lock { /** Lock protecting in-flight RPC concurrency. */ struct mutex rpcl_mutex; /** Intent associated with currently executing request. */ struct lookup_intent *rpcl_it; /** Used for MDS/RPC load testing purposes. */ int rpcl_fakes; }; Client can send many metadata requests to MDS simultaneously, but MDS needs to store each client's last transaction ID and it's serialized. LU-5319 supports multiple slots per client in last_rcvd file (Under development by Intel and Bull).
Modified mdtest for Lustre Basic Function Supports multiple mount points on a single client Helps generating heavy metadata load from single client Background Originally developed by Liang Zhen for LU-56 work We rebased and cleaned up codes and made few enhancements Enables metadata benchmarks on a small number of clients Regression testing MDS server sizing Performance optimization
Performance Comparison Single Lustre client mounts /lustre_0, /lustre_1,... /lustre_31 for single filesystem # mdtest n 10000 u d /lustre_{0-15} Single Client Metadata Performance (Unique, single mountpoint) Single Client Metadata Performance (Unique, multimountpoints) File creation File stat File removal File creation File stat File removal 60000 60000 50000 50000 40000 40000 ops/sec 30000 ops/sec 30000 20000 20000 10000 10000 0 1 2 4 8 16 Number of Threads 0 1 2 4 8 16 Number of Threads
Benchmark Configuration Client MDS OSS 2 x MDS(2 x E5-2676v2, 128GB memory) 4 x OSS(2 x E5-2680v2, 128GB memory) 32 x Client(2 x E5-2680, 128GB memory) SFA12K-40 400 x NL-SAS for OST 8 x SSD for MDT Lustre-2.5 for Servers Lustre-2.6.52, Lustre-1.8.9 for Client RP0 RP1 RP0 RP1
Metadata Benchmark Method Tested Metadata Operations Directory/File Creation Directory/File Stat Directory/File Removal Access patterns To Unique Directory and shared directory o P0 -> /lustre/dir0/file.0.0, P1 -> /lustre/dir1/file.0.1 (Unique) o P0 -> /lustre/dir/file.0.0, P1 -> /lustre/dir/file.1.0 (Shared)
Lustre Metadata Performance Impact MDS's CPU speed Metadata Performance comparison (Unique Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS) Directory Operation(Unique) File Operation(Unique) Dir Creation Dir Stats Dir Removal File Creation Fie Stats File Removal 180% 180% 160% 140% 120% 100% 20% 38% 160% 140% 120% 100% 60% 38% 80% 80% 60% 60% 40% 40% 20% 20% 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz 70% 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz
Lustre Metadata Performance Impact MDS's CPU speed Metadata Performance comparison (Shared Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS) Directory Operation(Shared) Dir Creation Dir Stats Dir Removal File Operation(Shared) File Creation Fie Stats File Removal 180% 180% 160% 140% 22% 120% 50% 160% 140% 120% 58% 30% 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz
Lustre Metadata Performance Impact MDS's CPU Cores Metadata Performance comparison (Unique Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors Directory Operation(Unique) Dir Creation Dir Stats Dir Removal File Operation(Unique) File Creation Fie Stats File Removal 250% 250% 200% 200% 80-120% 150% 150% 100% 25% 100% 50% 50% 0% 0% 100%
Lustre Metadata Performance Impact MDS's CPU Cores Metadata Performance comparison (Shared Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors Directory Operation(Shared) File Operation(Shared) 160% 140% Dir Creation Dir Stats Dir Removal 180% 160% File Creation Fie Stats File Removal Creation(Sshare) and Stat do not scale No scale on 12->16CPU 120% 100% 80% 60% 140% 120% 100% 80% 60% 60% 40% 40% 20% 20% 0% 0%
Lustre Metadata Performance MDSs Scalability (Unique Directory) Lustre Metadata Scalability (Unique) (32 clients, 1024 mount points) 250000 multiple mount points 200000 ops/sec 150000 100000 50000 Directory Creation 100% by DNE File Creation File Removal Dir Creation Dir Removal File Creation(DNE) File Removal(DNE) 0 16 32 64 128 256 512 1024 Number of Threads 50% of File creation
Lustre Metadata Performance MDSs Scalability (Shared Directory) 100000 90000 80000 Lustre Metadata Scalability (Shared) (32 clients, 1024 mount points) multiple mount points 80% of Performance compared to UniqueDir patterns 70000 ops/sec 60000 50000 40000 30000 20000 10000 Directory Creation File Creation File Removal Dir Creation Dir Removal 0 16 32 64 128 256 512 1024 Number of Threads
Lustre Metadata Performance File creation and removal for small files Creating files with actual file size (4K, 8K, 16K, 32K, 64K and 128K) (Stripe Count=1) Small File Performance (Unique Directory) (32 clients, 1024 mount points) File Creation File Read File Removal Small File Performance (Shared Directory) (32 clients, 1024 mount points) File Creation File Read File Removal 180000 180000 160000 160000 140000 140000 120000 120000 ops/sec 100000 80000 ops/sec 100000 80000 60000 40000 20000 60000 Small file performance bounds on metadata performance, 40000 but no performance impacts with file size. 20000 0 0 4096 8192 16384 32768 65536 131072 File Size(byte) 0 0 4096 8192 16384 32768 65536 131072 File Size(byte)
Summary Observations MDS Server resources significantly affect Lustre Metadata performance Performance scales well by number of CPU core and CPU Speed in unique directory access, but not CPU bound for shared directory access pattern Collected baseline results with 16 CPU cores, but need more tests on CPU cores Performance is highly dependent on metadata access pattern Example: Directory Creation vs. File Creation With actual file size (instead of zero byte), less impact in the case of a small number of OST(e.g. up to 40 OST), but testing on large number of OSTs is needed
Metadata Performance: Future Work Known Issues and Optimizations Client-side metadata optimization and especially single-client metadata performance Various performance regressions in Lustre 2.5/2.6 that need to be addressed (e.g. LU5608) Areas of Future Investigation Real-world metadata use scenarios and metadata problems Real-world small-file performance (e.g. life sciences) Impact of OST data structures on real world metadata performance DNE scalability on very large systems with many MDSs/MDTs and many OSSs/OSTs