IME Infinite Memory Engine Technical Overview

Similar documents
Infinite Memory Engine Freedom from Filesystem Foibles

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Using DDN IME for Harmonie

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Improved Solutions for I/O Provisioning and Application Acceleration

DDN About Us Solving Large Enterprise and Web Scale Challenges

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

HPC Storage Use Cases & Future Trends

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Application Performance on IME

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

CA485 Ray Walshe Google File System

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

SFA12KX and Lustre Update

Method to Establish a High Availability and High Performance Storage Array in a Green Environment

Distributed File Systems II

Design Considerations for Using Flash Memory for Caching

The Google File System

The Google File System

The Google File System

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Parallel File Systems. John White Lawrence Berkeley National Lab

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Using Transparent Compression to Improve SSD-based I/O Caches

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Academic Workflow for Research Repositories Using irods and Object Storage

Early Evaluation of the "Infinite Memory Engine" Burst Buffer Solution

Current Topics in OS Research. So, what s hot?

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

TRANSACTIONAL FLASH CARSTEN WEINHOLD. Vijayan Prabhakaran, Thomas L. Rodeheffer, Lidong Zhou

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

CS370 Operating Systems

Distributed Filesystem

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

Accelerating Spectrum Scale with a Intelligent IO Manager

Triton file systems - an introduction. slide 1 of 28

A Gentle Introduction to Ceph

Applying DDN to Machine Learning

CS3600 SYSTEMS AND NETWORKS

High Performance Solid State Storage Under Linux

Optimizing RDM Server Performance

Flash Storage Complementing a Data Lake for Real-Time Insight

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

A New Key-Value Data Store For Heterogeneous Storage Architecture

Storage. CS 3410 Computer System Organization & Programming

Andreas Dilger, Intel High Performance Data Division LAD 2017

PESIT Bangalore South Campus

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Optimizing SDS for the Age of Flash. Krutika Dhananjay, Raghavendra Gowdappa, Manoj Hat

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

NVM Express over Fabrics Storage Solutions for Real-time Analytics

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Lustre and PLFS Parallel I/O Performance on a Cray XE6

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

HYDRAstor: a Scalable Secondary Storage

Architecting For Availability, Performance & Networking With ScaleIO

Creating Storage Class Persistent Memory With NVDIMM

memory VT-PM8 & VT-PM16 EVALUATION WHITEPAPER Persistent Memory Dual Port Persistent Memory with Unlimited DWPD Endurance

Isilon Performance. Name

The Google File System (GFS)

CS370 Operating Systems

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

CSE 124: Networked Services Lecture-17

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

SSD/Flash for Modern Databases. Peter Zaitsev, CEO, Percona November 1, 2014 Highload Moscow,Russia

Extending the NVMHCI Standard to Enterprise

PowerVault MD3 SSD Cache Overview

6. Results. This section describes the performance that was achieved using the RAMA file system.

Fully journaled filesystems. Low-level virtualization Filesystems on RAID Filesystems on Flash (Filesystems on DVD)

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Storage Evaluations at BNL

Data storage on Triton: an introduction

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

HGST: Market Creator to Market Leader

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

PCIe Storage Beyond SSDs

Google File System. Arun Sundaram Operating Systems

Deep Learning Performance and Cost Evaluation

Chapter 11: File System Implementation. Objectives

Storage. Hwansoo Han

CLOUD-SCALE FILE SYSTEMS

Transcription:

1 1 IME Infinite Memory Engine Technical Overview

2 Bandwidth, IOPs single NVMe drive

3 What does Flash mean for Storage? It's a new fundamental device for storing bits. We must treat it different from HDD We have to manage data placement across tiers at a larger scale + New opportunities for Novel Developments around scaling, data security, performance protection

4 DDN IME Application I/O Workflow Compute Diverse, high concurrency applications Fast Data NVM & SSD Persistent Data (Disk) Lightweight IME client intercepts application I/O. Places fragments into buffers + parity IME client sends fragments to IME servers IME servers write buffers to NVM and manage internal metadata IME servers write aligned sequential I/O to SFA backend Parallel File system operates at maximum efficiency

5 Distributed Hash Table + Log Structure Filesystem Data Key Distributed Network File1 File4 File3 File6 Hash Function DFCD3455 52ED789E 46042D43 DC355CE peers DHT provides foundation for Network parallelism Node-level fault tolerance Distributed metadata Log Tail Space reclaimed here data empty space data data data data Log Head data New data added here Log (time) wrap Log Structured Filesystem used at the storage device level High performance device throughput (NAND Flash Maintaining device longevity

6 DDN IME DataFlow in the Client COMPUTE NODE APPLICATION POSIX or MPIIO I/O is issued by the application IME places fragments into buffers 2 IME LUSTRE 1 data buffers accumulator parity buffers are simultaneously built parity 3 file open, file close, stat 4 metadata requests are passed through to the PFS client 5 Once full, buffers sent to IME server layer

7 DDN IME Erasure Coding application COMPUTE Data protection against IME server or SSD Failure is optional IME data buffers parity buffers (the lost data is "just cache ) LUSTRE Erasure Coding calculated at the Client Great scaling with extremely high client count Servers don't get clogged up IME Erasure coding does reduce useable Client bandwidth and useable IME capacity: 3+1: 56Gb 42Gb 5+1: 56Gb 47Gb 7+1: 56Gb 49Gb 8+1: 56Gb 50Gb PFS

8 DDN IME Data Residency Control COMPUTE maximum percentage of dirty data resident in IME before the data is automatically synchronized to the PFS: flush_threshold_ratio [0%.. 100%] Once Synchronised, the data is marked clean The clean data is kept in IME until the min_free_space_ratio is reached. min_free_space_ratio [0%.. 100%] I M E DIRTY CLEAN purge clean data until min_free_space_ratio sync dirty data until flush_threshold_ratio PFS

9 Use of Log Structuring in IME Consider two different application I/O patterns (write) Sequential Non-sequential Burst buffer blocks (BBB) are really just buffers generated at the client Note the contents of the BBB can be aligned or not. The same storage method is used for both blocks (despite the qualitative difference of their contents)!

10 Use of Log Structuring in IME What does this give us? Near line rate performance regardless of output pattern. 3 IOR Checkpoints to IME @~50GB/s (4k strided, shared-file)

11 Non-Deterministic Data Placement Deterministic Approach: IME Clients use hash function to place data on rank0 host Non-deterministic approach: IME clients learn and observe the load of IME servers and route write requests to avoid highlyloaded servers COMPUTE Client Pending Request Queue Lengths IME

12 Aggregate IME Adaptive vs. Non-Adaptive WRITE Performance Ideal, healthy system One degraded IME server, Adaptive Amdahl s Law in action! One degraded IME server, Non-adaptive

13 IME v1.0 Mount Points FUSE client provides IME POSIX mount point df -h /ime/gsfs Filesystem Size Used Avail Use% Mounted on imefs 26T 3.9T 22T 16% /ime/gsfs Filesystem Mount Point # df -h /dev/gsfs/ Filesystem Size Used Avail Use% Mounted on /dev/gsfs 26T 3.9T 22T 16% /gsfs

14

15

16

17

18

19 Parallel File System: Shared File Performance Filesystem locking

20 IME vs Parallel File System: Shared File Performance

21 Rack Performance: IME IOR File-per-Process (GB/s) 4k Random IOPS 600 100,000,000 500 10,000,000 1,000,000 400 100,000 300 10,000 200 1,000 100 100 10 0 Write Read 1 Write Read ~550GB/s Read, Write ~50 Million IOPs

22 Benchmark Data POSIX Single Shared File IOR with Segments GB/sec 600 IME 20 Nodes (GB/sec) IOPS 100,000,000 IME 20 Nodes (IOPS) 500 10,000,000 1,000,000 400 100,000 300 10,000 200 1,000 100 100 10-4k 8k 16k 32k 64k 128k 256k 512k 1024k 1 4k 8k 16k 32k 64k 128k 256k 512k 1024k Write(MB/sec) Read(MB/sec) Write IOPS Read IOPS

23 IME - Burst Buffer Productizing many years of research and development Yields a huge percentage of peak BW No server side erasure code overhead Fewer memory copies / visits Flexible data placement Allows server to avoid slow or oversubscribed components Log structured writes Utilize NVM devices in the most performant manner Completely declustered RAID Rebuilds of devices in MINUTES not days

24 Thank You! Keep in touch with us sales@ 9351 Deering Avenue Chatsworth, CA 91311 @ddn_limitless 1.800.837.2298 1.818.700.4000 company/datadirect-networks