SFA12KX and Lustre Update

Similar documents
Lustre Metadata Fundamental Benchmark and Performance

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

2012 HPC Advisory Council

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

HPC Storage Use Cases & Future Trends

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Enhancing Lustre Performance and Usability

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Lustre overview and roadmap to Exascale computing

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

High-Performance Lustre with Maximum Data Assurance

Xyratex ClusterStor6000 & OneStor

Emerging Technologies for HPC Storage

Using DDN IME for Harmonie

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

DDN About Us Solving Large Enterprise and Web Scale Challenges

High Capacity network storage solutions

Demonstration Milestone for Parallel Directory Operations

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Introducing Panasas ActiveStor 14

IME Infinite Memory Engine Technical Overview

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Isilon Performance. Name

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

White Paper. EonStor GS Family Best Practices Guide. Version: 1.1 Updated: Apr., 2018

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Application Performance on IME

CSCS HPC storage. Hussein N. Harake

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Lustre 2.8 feature : Multiple metadata modify RPCs in parallel

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

THE SUMMARY. CLUSTER SERIES - pg. 3. ULTRA SERIES - pg. 5. EXTREME SERIES - pg. 9

Hitachi Virtual Storage Platform Family

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

SSD Architecture Considerations for a Spectrum of Enterprise Applications. Alan Fitzgerald, VP and CTO SMART Modular Technologies

What is QES 2.1? Agenda. Supported Model. Live demo

Architecting a High Performance Storage System

Entry-level Intel RAID RS3 Controller Family

DELL Terascala HPC Storage Solution (DT-HSS2)

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Extraordinary HPC file system solutions at KIT

Parallel File Systems for HPC

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Dell Fluid Data solutions. Powerful self-optimized enterprise storage. Dell Compellent Storage Center: Designed for business results

IBM System Storage DCS3700

Infinite Memory Engine Freedom from Filesystem Foibles

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

FhGFS - Performance at the maximum

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Deep Learning Performance and Cost Evaluation

NetApp High-Performance Storage Solution for Lustre

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Deep Learning Performance and Cost Evaluation

An Exploration of New Hardware Features for Lustre. Nathan Rutman

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Copyright 2012 EMC Corporation. All rights reserved.

2008 International ANSYS Conference

MAHA. - Supercomputing System for Bioinformatics

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Data Movement & Tiering with DMF 7

An Overview of Fujitsu s Lustre Based File System

Atrato SOLVE - Scalable Offload Logical Volume Engine

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Accelerating Spectrum Scale with a Intelligent IO Manager

Storage Update and Storage Best Practices for Microsoft Server Applications. Dennis Martin President, Demartek January 2009 Copyright 2009 Demartek

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Oracle Performance on M5000 with F20 Flash Cache. Benchmark Report September 2011

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Architecting Storage for Semiconductor Design: Manufacturing Preparation

NST6000 UNIFIED HYBRID STORAGE. Performance, Availability and Scale for Any SAN and NAS Workload in Your Environment

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

Next Generation Computing Architectures for Cloud Scale Applications

High Performance Computing. NEC LxFS Storage Appliance

Demystifying Storage Area Networks. Michael Wells Microsoft Application Solutions Specialist EMC Corporation

Applying DDN to Machine Learning

HCI: Hyper-Converged Infrastructure

IBM System Storage DS8870 Release R7.3 Performance Update

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Efficient Object Storage Journaling in a Distributed Parallel File System

Architected for Performance. NVMe over Fabrics. September 20 th, Brandon Hoff, Broadcom.

<Insert Picture Here> Oracle Storage

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

The Future of Interconnect Technology

IBM Storwize V7000 Unified

Improved Solutions for I/O Provisioning and Application Acceleration

Transcription:

Sep 2014 SFA12KX and Lustre Update Maria Perez Gutierrez HPC Specialist HPC Advisory Council

Agenda SFA12KX Features update Partial Rebuilds QoS on reads Lustre metadata performance update 2

SFA12KX Features Update 3

Big Data & Cloud Infrastructure DDN FY14 Focus Areas Big Data Platform Management DirectMon * GA = Future Release Analytics Reference Architectures SATA FY14 SFA12KX/E 40GB/s/1.7M IOPS 1,680 Drives supported Embedded Computing Petascale Lustre Storage EXAScaler 10Ks of Clients 1TB/s+, HSM Linux HPC Clients NFS SFA7700 13GB/s, 600K IOPS 7700x 7700 E* Enterprise Scale-Out File Storage GRIDScaler ~10K Clients 1TB/s+, HSM Linux/Windows HPC Clients NFS & CIFS Storage Fusion Architecture Core Storage Platforms Flexible Drive Configuration SAS SSD Infinite Memory Engine Distributed File System Buffer Cache* [Demo] SFX Automated Flash Caching Read Context Commit Instant Commit WolfCreek FY14 FY14 Cloud Foundation S3/Swift Cloud Tiering WOS 3.0 32 Trillion Unique Objects Geo-Replicated Cloud Storage 256 Million Objects/Second Self-Healing Cloud Parallel Boolean Search WOS7000 60 Drives in 4U Self-Contained Servers Highlights Platform SFA Hardening Higher speed Embedded WolfCreek 7700E* 7700x* Full speed on IME PFS acceleration More use cases under review WOS S3/Swift WOS Access Cost reduction Performance improvements DDN Confidential NDA Required, Roadmap Subject to Change

SFA 12K Family addition: SFA12KX High-Scale Active/Active Block Storage Appliance, Available May 2014 Specifications Appliances CPU Memory & Battery-Backed Cache SFX Flash cache Active/Active Block Controllers Dual Socket Intel 8Core Ivy-Bridge 128GB DDR3 1866 64GB Mirrored Up to 12.8 TB Write Intensive SSDs SFA12KX with 20x SS8460 Enclosures RAID Levels RAID 1/5/6 IB Host Connectivity 16 x 56Gb (FDR) FC Host Connectivity 32 x 16Gb/s Drive Support 2.5 and 3.5 SAS, SATA, SSD Max JBOD Expansion Up to 20 Supported JBODs SS7000, SS8460 SFA12KX GA: Q2 2014 Bandwidth (GB/s) 50 40 Write* Read* 41 48 * Large Block, Full Stripe ** Read I/O Includes Parity Verification 30 Infiniband SRP * Estimated Performance DDN Confidential NDA Required, Roadmap Subject to Change

Introducing 12KX (Q2 2014) Rate (MB/s) 25,000 20,000 15,000 10,000 5,000 0 Random Write Throughput 1 8 16 24 32 40 48 56 Pool Count Sequential Write Throughput 512K I/O Size 1M I/O Size 2M I/O Size 4M I/O Size 8M I/O Size ü Over 40 GB/s Reads AND Writes ü Over 20 GB/s MIRRORED Writes ü Linear Scalability ü SFX Ready ü Latest Intel Processor ü 32 Ports FC-16 Rate (MB/s) 45,000 40,000 35,000 30,000 25,000 20,000 15,000 10,000 5,000 0 1 8 16 24 32 40 48 56 Pool Count 64K I/O Size 256K I/O Size 512K I/O Size 1M I/O Size 2M I/O Size 4M I/O Size 8M I/O Size

SFA Feature Baseline Priority, Queuing, Real-Time QoS for Read Operations; Critical for Striped File Performance Consistency ReACT IO Fairness Lowest Latency for Small IO Highest Bandwidth for Large & Streaming I/O Dialed-in Host I/O priority during rebuilds Storage Fusion Fabric Highly-Over-Provisioned Backend - RAIDed Fabric Withstands Failures of Drives, Enclosures, Cabling, etc Performance, Flash Real-Time, Multi-CPU RAID Engine Interrupt-Free; Massively Parallel I/O Delivery System RAID Rebuild Acceleration SFX Cache, Automated, data-driven caching system enabling hybrid SSD & HDD systems Reliabiltiy, Availability & Monitoring - Mgmt Raid 10 (8+2) Fast Pool Rebuilds partial rebuilds Rebuild priority adjustable SSD Life-counter High Density Array Up to 840 HDDs in a single rack. 84 HDDs / SSDs in 4U Data Integrity & Security DirectProtect RAID1/5/6 Real-Time Data Integrity Verification for every I/O SED management Data at-rest encryption of all user data. Instant Crypto Erase DDN Confidential NDA Required, Roadmap Subject to Change

SFA Partial Rebuild Example persistent partial rebuild 1 2 3 4 5 Complete enclosure removed for F/W upgrade. Controllers send I/O destined for failed enclosure to available drives. Holds corresponding metadata in synchronous mirrored cache. Controller 1 fails. I/O continues. Complete system outage due to power failure. controller 2 writes cache to onboard drives. Power restored. Controller cache restored. Upgraded Enclosure undergoes partial rebuild of cached data in minutes. I/O I/O I/O I/O 84 * 4TB Disks (336TB) removed for hours, rebuilt in minutes

SFA Quality of Service Read retry timeouts coupled with DirectProtect DIF Raid6 (8+2) of NL-SAS Disks, one of the member has ~100% higher avg latencies than others. Production not impacted thanks to DDN s QoS

Lustre Metadata Benchmark and Performance / How to scale on Lustre metadata performance 10

Lustre Metadata Performance Lustre metadata is a crucial performance metric for many Lustre user LU-56 SMP Scaling (Lustre-2.3) DNE (Lustre-2.4) Metadata performance is related to small file performance on Lustre But, metadata performance is still a little mysterious J Performance differentiation by metadata type and access patterns? What is the impact of hardware resources for metadata performance? This presentation: use standard metadata benchmark tools to analyze metadata performance on Lustre today

Lustre Metadata Benchmark Tools mds-survey Build into Lustre code Similar to obdfilter-suvey Generates loads on MDS to simulate Lustre metadata performance mdtest Major metadata benchmark tool used by many large HPC sites Runs on clients using MPI Several metadata operation and access patterns are supported

Single Client Metadata Performance Limitation Single client Metadata performance does not scale with threads. ops/sec 60000 50000 40000 30000 20000 10000 0 Single Client Metadata Performance (Unique) File creation File stat File removal 1 2 4 8 16 Number of Threads lustre/include/lustre_mdc.h /** * Serializes in-flight MDT-modifying RPC requests to preserve idempotency. * * This mutex is used to implement execute-once semantics on the MDT. * The MDT stores the last transaction ID and result for every client in * its last_rcvd file. If the client doesn't get a reply, it can safely * resend the request and the MDT will reconstruct the reply being aware * that the request has already been executed. Without this lock, * execution status of concurrent in-flight requests would be * overwritten. * * This design limits the extent to which we can keep a full pipeline of * in-flight requests from a single client. This limitation could be * overcome by allowing multiple slots per client in the last_rcvd file. */ struct mdc_rpc_lock { /** Lock protecting in-flight RPC concurrency. */ struct mutex rpcl_mutex; /** Intent associated with currently executing request. */ struct lookup_intent *rpcl_it; /** Used for MDS/RPC load testing purposes. */ int rpcl_fakes; }; Client can send many metadata requests to MDS simultaneously, but MDS needs to store each client's last transaction ID and it's serialized. LU-5319 supports multiple slots per client in last_rcvd file (Under development by Intel and Bull).

Modified mdtest for Lustre Basic Function Supports multiple mount points on a single client Helps generating heavy metadata load from single client Background Originally developed by Liang Zhen for LU-56 work We rebased and cleaned up codes and made few enhancements Enables metadata benchmarks on a small number of clients Regression testing MDS server sizing Performance optimization

Performance Comparison Single Lustre client mounts /lustre_0, /lustre_1,... /lustre_31 for single filesystem # mdtest n 10000 u d /lustre_{0-15} Single Client Metadata Performance (Unique, single mountpoint) Single Client Metadata Performance (Unique, multimountpoints) File creation File stat File removal File creation File stat File removal 60000 60000 50000 50000 40000 40000 ops/sec 30000 ops/sec 30000 20000 20000 10000 10000 0 1 2 4 8 16 Number of Threads 0 1 2 4 8 16 Number of Threads

Benchmark Configuration Client MDS OSS 2 x MDS(2 x E5-2676v2, 128GB memory) 4 x OSS(2 x E5-2680v2, 128GB memory) 32 x Client(2 x E5-2680, 128GB memory) SFA12K-40 400 x NL-SAS for OST 8 x SSD for MDT Lustre-2.5 for Servers Lustre-2.6.52, Lustre-1.8.9 for Client RP0 RP1 RP0 RP1

Metadata Benchmark Method Tested Metadata Operations Directory/File Creation Directory/File Stat Directory/File Removal Access patterns To Unique Directory and shared directory o P0 -> /lustre/dir0/file.0.0, P1 -> /lustre/dir1/file.0.1 (Unique) o P0 -> /lustre/dir/file.0.0, P1 -> /lustre/dir/file.1.0 (Shared)

Lustre Metadata Performance Impact MDS's CPU speed Metadata Performance comparison (Unique Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS) Directory Operation(Unique) File Operation(Unique) Dir Creation Dir Stats Dir Removal File Creation Fie Stats File Removal 180% 180% 160% 140% 120% 100% 20% 38% 160% 140% 120% 100% 60% 38% 80% 80% 60% 60% 40% 40% 20% 20% 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz 70% 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz

Lustre Metadata Performance Impact MDS's CPU speed Metadata Performance comparison (Shared Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS) Directory Operation(Shared) Dir Creation Dir Stats Dir Removal File Operation(Shared) File Creation Fie Stats File Removal 180% 180% 160% 140% 22% 120% 50% 160% 140% 120% 58% 30% 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz 0% 2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz

Lustre Metadata Performance Impact MDS's CPU Cores Metadata Performance comparison (Unique Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors Directory Operation(Unique) Dir Creation Dir Stats Dir Removal File Operation(Unique) File Creation Fie Stats File Removal 250% 250% 200% 200% 80-120% 150% 150% 100% 25% 100% 50% 50% 0% 0% 100%

Lustre Metadata Performance Impact MDS's CPU Cores Metadata Performance comparison (Shared Directory) 32 clients(1024 mount points), 1024 processes, 1.28M Files Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors Directory Operation(Shared) File Operation(Shared) 160% 140% Dir Creation Dir Stats Dir Removal 180% 160% File Creation Fie Stats File Removal Creation(Sshare) and Stat do not scale No scale on 12->16CPU 120% 100% 80% 60% 140% 120% 100% 80% 60% 60% 40% 40% 20% 20% 0% 0%

Lustre Metadata Performance MDSs Scalability (Unique Directory) Lustre Metadata Scalability (Unique) (32 clients, 1024 mount points) 250000 multiple mount points 200000 ops/sec 150000 100000 50000 Directory Creation 100% by DNE File Creation File Removal Dir Creation Dir Removal File Creation(DNE) File Removal(DNE) 0 16 32 64 128 256 512 1024 Number of Threads 50% of File creation

Lustre Metadata Performance MDSs Scalability (Shared Directory) 100000 90000 80000 Lustre Metadata Scalability (Shared) (32 clients, 1024 mount points) multiple mount points 80% of Performance compared to UniqueDir patterns 70000 ops/sec 60000 50000 40000 30000 20000 10000 Directory Creation File Creation File Removal Dir Creation Dir Removal 0 16 32 64 128 256 512 1024 Number of Threads

Lustre Metadata Performance File creation and removal for small files Creating files with actual file size (4K, 8K, 16K, 32K, 64K and 128K) (Stripe Count=1) Small File Performance (Unique Directory) (32 clients, 1024 mount points) File Creation File Read File Removal Small File Performance (Shared Directory) (32 clients, 1024 mount points) File Creation File Read File Removal 180000 180000 160000 160000 140000 140000 120000 120000 ops/sec 100000 80000 ops/sec 100000 80000 60000 40000 20000 60000 Small file performance bounds on metadata performance, 40000 but no performance impacts with file size. 20000 0 0 4096 8192 16384 32768 65536 131072 File Size(byte) 0 0 4096 8192 16384 32768 65536 131072 File Size(byte)

Summary Observations MDS Server resources significantly affect Lustre Metadata performance Performance scales well by number of CPU core and CPU Speed in unique directory access, but not CPU bound for shared directory access pattern Collected baseline results with 16 CPU cores, but need more tests on CPU cores Performance is highly dependent on metadata access pattern Example: Directory Creation vs. File Creation With actual file size (instead of zero byte), less impact in the case of a small number of OST(e.g. up to 40 OST), but testing on large number of OSTs is needed

Metadata Performance: Future Work Known Issues and Optimizations Client-side metadata optimization and especially single-client metadata performance Various performance regressions in Lustre 2.5/2.6 that need to be addressed (e.g. LU5608) Areas of Future Investigation Real-world metadata use scenarios and metadata problems Real-world small-file performance (e.g. life sciences) Impact of OST data structures on real world metadata performance DNE scalability on very large systems with many MDSs/MDTs and many OSSs/OSTs