Infinite Memory Engine Freedom from Filesystem Foibles

Similar documents
IME Infinite Memory Engine Technical Overview

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Improved Solutions for I/O Provisioning and Application Acceleration

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Using DDN IME for Harmonie

Application Performance on IME

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

HPC Storage Use Cases & Future Trends

DDN About Us Solving Large Enterprise and Web Scale Challenges

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

Parallel File Systems. John White Lawrence Berkeley National Lab

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

SFA12KX and Lustre Update

Applying DDN to Machine Learning

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Emerging Technologies for HPC Storage

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

IBM Spectrum Scale IO performance

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Fast Forward I/O & Storage

Accelerating Spectrum Scale with a Intelligent IO Manager

SolidFire and Ceph Architectural Comparison

Using Transparent Compression to Improve SSD-based I/O Caches

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Using file systems at HC3

Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading

The Fusion Distributed File System

Storage Optimization with Oracle Database 11g

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

AFM Use Cases Spectrum Scale User Meeting

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Leveraging Flash in HPC Systems

A Gentle Introduction to Ceph

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

I/O: State of the art and Future developments

Design Considerations for Using Flash Memory for Caching

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Refining and redefining HPC storage

HCI: Hyper-Converged Infrastructure

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

IBM CORAL HPC System Solution

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Blue Waters I/O Performance

outthink limits Spectrum Scale Enhancements for CORAL Sarp Oral, Oak Ridge National Laboratory Gautam Shah, IBM

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

PESIT Bangalore South Campus

Storage. Hwansoo Han

Early Evaluation of the "Infinite Memory Engine" Burst Buffer Solution

Techniques to improve the scalability of Checkpoint-Restart

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

HPSS Treefrog Summary MARCH 1, 2018

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

Increasing Performance of Existing Oracle RAC up to 10X

The Google File System

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Parallel File Systems Compared

Universal Storage. Innovation to Break Decades of Tradeoffs VASTDATA.COM

Triton file systems - an introduction. slide 1 of 28

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Huawei OceanStor 9000 Scale-Out NAS. Technical White Paper. Issue 05. Date HUAWEI TECHNOLOGIES CO., LTD.

libhio: Optimizing IO on Cray XC Systems With DataWarp

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

OPS-23: OpenEdge Performance Basics

Current Topics in OS Research. So, what s hot?

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Distributed Filesystem

Introduction to HPC Parallel I/O

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Andreas Dilger, Intel High Performance Data Division LAD 2017

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Cluster-Level Google How we use Colossus to improve storage efficiency

Filesystems on SSCK's HP XC6000

PowerVault MD3 SSD Cache Overview

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Isilon Performance. Name

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Parallel I/O on JUQUEEN

Building dense NVMe storage

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Transcription:

1 Infinite Memory Engine Freedom from Filesystem Foibles James Coomer 25 th Sept 2017

2 Bad stuff can happen to filesystems Malaligned High Concurrency Random Shared File COMPUTE NODES FILESYSTEM

3 And Applications want to do bad stuff Multi-Physics Complex mixed I/O Workflows+Ensembles Globally coordinated I/O Adaptive Mesh Refinement Varying I/O sizes Checkpoints Shared File I/O Machine Learning Read intensive High Concurrency Extreme thread counts Free your applications from the limits of filesystems

4 How do I make my application go faster? Optimal/Natural behaviour already exist within the algorithms Difficult to optimise for multiple aspects simultaneously (GPU, multi-threading, network changes, caches) Difficult to optimise our application for new environments Difficult to maintain efficient porting even between different node counts

5 How do I make my application go faster?

6 IME Scale-Out Data Platform IME Clients IME Servers Lustre or GPFS COMPUTE NODES IME SERVERS FILESYSTEM

7 IO Management Data Flow to Servers COMPUTE Clients send IO (network Buffers containing fragments and fragment descriptors) to IME server layer 1 3 Prior to syncronising dirty data to the BFS, IO fragments IME01 IME02 IME03 IME04 which belong IME05 to the same BFS stripe are coalesced IME servers write to Log-structured filesystem and register writes with peers 2 IME 4 IO sent to the BFS is optimised as aligned, sequential IO BACKING FILE SYSTEM

8 IO Management DataFlow in the Client: Flexible Erasure Coding COMPUTE NODE POSIX or MPIIO I/O is issued by the application IME places data fragments into buffers and builds corresponding parity buffers 2 IME LUSTRE 1 data buffers APPLICATION parity 3 data and parity buffers sent to IME servers file open, file close, stat metadata requests passed through to the PFS client Erasure coding options: 1+1 1+2 1+3 2+1 2+2 2+3 2+3 3+2 3+3......... 15+1 15+2 15+3

9 Data Key Distributed Network File1 File4 File3 File6 Hash Function DISTRIBUTED HASH TABLE DFCD3455 52ED789E 46042D43 DC355CE peers LOG STRUCTURED FILESYSTEM DHT provides foundation for Network parallelism Node-level fault tolerance Distributed metadata Self-Optimising for Noisy Fabrics Space reclaimed here empty space Log Tail data data data data data Log Head data New data added here à Log (time) wrap Log Structured Filesystem at the storage device level High performance device throughput (NAND Flash) Maximises device lifetime

10 IME DataFlow Compute SFA Diverse, high concurrency applications Fast Data NVM & SSD Persistent Data (Disk) Application issues IO to IME client. Erasure Coding applied IME client sends fragments to IME servers IME servers write buffers to NVM and manage internal metadata IME servers write aligned sequential I/O to SFA backend Parallel File system operates at maximum efficiency

11 Amdahl's Law applied to parallel filesystems bad drive on IO node #4 The system will run as fast as the slowest component

12 Fabric-Aware Self-Optimising IO IME clients learn and observe the load of IME servers and route write requests to avoid highly-loaded servers Pending Request Queue Lengths COMPUTE IO redirected in real-time to avoid bottlenecks IME

IO Deblender 13 IME improve efficiency of Backing FIlesystem Application IO IO to Backing Filesystem 13

14 Flash-Cache Ecomonics Scale-Out Flash Cache Parallel File System Real World IO sub-system activity 99% of the time <30% 70% of the time <5% à Build a cache for the peaks, PFS for the capacity 10x Performance/Watt improvement for flash-cache over PFS Argonne: P. Carns, K. Harms et al., Understanding and Improving Computational Science Storage Access through Continuous Characterization

15 IME predictable performance Eliminate IO commit variance of parallel file system 300 867 # Occurrences 250 200 150 100 50 Huge reduction in IO variance with IME IME PERF GAIN IME Lustre 0 14 10 6 5 4 2 1 2 1 1 1 <1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10 10-11 11-12 12-13 13-14 14-15 15-16 16-17 17-18 18-19 19-20 20-21 21-22 22-23 23-24 24-25 25-26 26-27 IO time (seconds) Results taken for large scale WRF benchmarking

16 IME Shared File IO improvements Parallel File systems can exhibit extremely poor performance for shared file IO due to internal lock management as a result of managing files in large lock units IME eliminates contention by managing IO fragments directly, and coalescing IO's prior to flushing to the parallel file system IME PERFORMANCE GAIN Process 0 Process 1 Process 2 Filesystem locking contention for shared file access File

17 Use of Log Structuring in IME Consider two different application I/O patterns (write) Sequential Non-sequential Burst buffer blocks (BBB) are really just buffers generated at the client Note the contents of the BBB can be aligned or not. The same storage method is used for both blocks (despite the qualitative difference of their contents)!

18 IME Straight Line Performance 25000 IME240 Throughput IOPs and Throughput for Random Write IO - single IME240 Throughput (MB/s) 20000 15000 10000 5000 0 4k 8k 16k 32k 64k 128k 256k 512k 1M 2M 4M Throughput (MB/s) 25000 20000 15000 10000 5000 0 4k 8k 16k 32k 64k 128k IO Size 1400 1200 1000 800 600 400 200 0 IOPs (1000s) >1M IOPs >20GB/s 2 Rack Units

19 Thank You! Keep in touch with us sales@ @ddn_limitless 9351 Deering Avenue, Chatsworth, CA 91311 1.800.837.2298 1.818.700.4000 company/datadirect-networks

20 IME1.1 Blistering Sequential Performance Single Shared File IME240 Sequential performance for Single Shared File over 22GB/s per server (POSIX IO) 15GB/s with 4k IOs (MPIIO) consistent read and write performance across IO sizes POSIX IO performance hits max values at ~32K IO sizes 4k Seq IO 1M Seq IO

21 IME1.2 Node Failure Management IME1.1 allows continued IO Service through IME server failure(s) Following server failure, IME commences background rebuild of all missing extents for each Parity Group Instance (PGI) : 3+1 across the 3 remaining servers 1 server will be overloaded with 2 blocks for each PGI Following a server failure but prior to rebuild completion, immediate read requests may be satisfied through on-the-fly reconstruction of missing file extents Service interruption will occur with a second IME server failure in this case (but not with NVM device failures that occur after rebuild) New writes still maintain 3+1 (with overloading) IME restart is needed to re-introduce failed node and restore original redundancy. (requires full sync to the BFS) Continued Production 4 IME Servers, Default Parity Schema= 3+1

22 IME HPC IO in the Flash Era Adaptive IO IME Clients adapt IO rates to the server layer according to load eliminating traditional IO system slowdowns Dial-in Resilience Erasure coding levels are not dictated by storage system setup, but dynamically set on a per file/per client basis Lightning Rebuilds Fully Declustered distributed Data rebuilds allow for rebuild rates in excess of 250GB/minute

23 IME Shared File Performance Shared File Performance Acceleration IME240 Sequential performance for Single Shared File over 22GB/s per server (POSIX IO) 15GB/s with 4k IOs (MPIIO) consistent read and write performance across IO sizes POSIX IO performance hits max values at ~32K IO sizes IME PERF GAIN 4k Seq IO 1M Seq IO

24 l l l Dynamics of interactions between nerve cells MPI + OpenMP I/O pattern burst of write at the GPFS scales imperfectly l l 200 MB/s for 4 nodes 400 MB/s for 16 nodes IME scales almost perfectly l l NEST NEural Simulation Tool 400 MB/s for 4 nodes 1500 MB/s for 16 nodes IME and /dev/null show nearly identical behavior. Except IME keeps the data.

25 IME WRF with I/O quilting one I/O process per node Application wall time reduced by 3.5 using IME I/O time reduces by x17.2 Move from I/O bound to compute bound Measurement Speed Up IO Wallclock 17.2x Total Execution Wallclock 3.5x

26 IME240 Starting Solution 3+1 Erasure Coding Raw Performance 80GB/s EC Perf: 60Write + 80Read Accelerate Existing Lustre and GPFS solutions Cope with Demanding Applications Relieve cross-contamination between filesystem users

27 IME Oil and Gas Seismic IME expands the cache volume from GBs to PBs and eliminates cache misses associated with Reverse Time Migration IO patterns time 0 0 1 1 I/O 2 k-2 k-2 2 High chance of cache miss k-1 k-1 Write Read Likely to be in cache Both compute node and storage side

28 IME TORTIA (Reverse Time Migration Code) 1.00 0.80 0.60 10x speedup in IO time 3x speedup in run time 0.40 0.20 0.00 Small 80GB Medium 950GB Large 8.4 TB Lustre IME Burst Buffer

29 200GB/s example: IME flash-cache economics For a 200GB/s system requirement, an IME solution price will be lower than that of an HDD Filesystem for capacities < 10PiB For an 8PiB Requirement: Pure Filesystem: 6x GS14KX, 2400 4T drives IME: 10x IME240, 1xGS14KX, 1440 8T drives <-- 20% lower cost,

30 What is IME Today? New Cache Layer using NVMe SSDs inserted between compute cluster and Parallel File System (PFS) IME is configured as CLUSTER with multiple SSD servers All compute nodes can access cache data on IME Accelerates "bad" IO on PFS Accelerates small IO or random IO by high IOPS due to SSD and IO management PFS is pretty good at large sequential IO Can be configured as cache layer having huge IO bandwidth eg. over 1TB/sec BW on JCAHPC Oakforest-PACS IME s Active I/O Tier, is inserted right between Compute and the parallel file system IME software intelligently virtualizes disparate NVMe SSDs into a single pool of shared memory that accelerates I/O, PFS & Applications DDN Confidential

31 DDN IME Data Residency Control COMPUTE maximum percentage of dirty data resident in IME before the data is automatically synchronized to the PFS: flush_threshold_ratio [0%.. 100%] Once Synchronised, the data is marked clean The clean data is kept in IME until the min_free_space_ratio is reached. min_free_space_ratio [0%.. 100%] I M E DIRTY CLEAN purge clean data until min_free_space_ratio sync dirty data until flush_threshold_ratio PFS

32 Extreme Rebuild Speeds Rebuild Rate for a 256GB SSD (86% full) Distributed Rebuild over all SSDs in Parity Group Each SSD performing rebuild at ~250MB/s 256GB SSD rebuilt in under 1 minute Percentage Rebuilt 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% data fully rebuilt 0 20 40 60 Time in Seconds SSD offlined

33 New Ratios for Performance Systems IME removes restrictions of HDD-based capacity/performance ratios. Makes multiple TB/s manageable Dramatically reduces component count, power consumption, space consumption and capital cost. (see Astar model) 30000 #Drives required for 1 TB/s 350 #Nodes Required for 1TB/s 25000 300 20000 15000 10000 5000 250 200 150 100 50 0 Blue Waters (All) NERSC ORNL Cori Phase (Spider 2) 1 FS NSCC (Burst Buffer) JCAHPC (Burst Buffer) 0 Blue Waters (All) NERSC Cori Phase 1 FS NERSC Cori Phase 1 BB ORNL (Spider 2) NSCC (Burst Buffer) JCAHPC (Burst Buffer)

34 DDN Data Consistency Model Lock Free Model Dirty or otherwise unsynchronized data within an IME Client's cache is not visible or retrievable by other clients. Synchronizing on close () ensures that IME Clients will provide close-to-open consistency DDN Confidential

35 WRF on IME 48 jobs across 240 compute nodes 48 concurrent MPI job 5 node/job 20 MPI rank/node IME erasure coding 7+1 240 compute nodes WRF output 500GB/s EDR 150GB/s WRF checkpoint IME LUSTRE ~390GB/s IME capacity consumption

36 WRF at Scale Summary Results IME Parallel File system # Throughput per Metric (GB/s/<x>) # Throughput per Metric (GB/s/<x>) IME Improvement Application Throughput (GB/s) 380 100 x 3.8 Rack Units 36 10.5 224 0.45 x 23 # IO Nodes 18 21 42 2.4 x 8.7 # Drives 432 0.9 2800 0.04 x 22 Power Consumption (KW) 27 14 70 1.4 x 10

37 IME Use Cases Jobs with Dependencies (workflows) Application ensembles: Multiple, simultaneous applications use cache to support communication throughout the job. Application workflows: cache stores a common dataset used by a succession of independent applications. In-situ analysis: real-time output of an application s data in cache analysed in-situ Application pre-processing: pre-processing job places output in cache, ready for main run. Application post-processing: main run outputs to cache, post-processing commences in situ. Visualization: users may want to keep the data available for use by shared compute systems. COMPUTE input JOB1 JOB2 JOB3 JOB5 JOB4 tmp.1 tmp.2 tmp.3 tmp.4 IME final Benefit Don t use PFS for scratch files Fast IO for dependent jobs e.g. Weather visualisation environments input BFS final