Accelerating Spectrum Scale with a Intelligent IO Manager

Similar documents
Tech Talk on HPC. Ken Claffey. VP, Cloud Systems. May 2016

Spectrum Scale CORAL Enhancements

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

Improved Solutions for I/O Provisioning and Application Acceleration

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Emerging Technologies for HPC Storage

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Refining and redefining HPC storage

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Storage for HPC, HPDA and Machine Learning (ML)

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

HPC Storage Use Cases & Future Trends

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Improving Ceph Performance while Reducing Costs

Isilon Performance. Name

Storage Systems. Storage Systems

Atrato SOLVE - Scalable Offload Logical Volume Engine

CSCS HPC storage. Hussein N. Harake

High-Performance Lustre with Maximum Data Assurance

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

IME Infinite Memory Engine Technical Overview

Oracle Performance on M5000 with F20 Flash Cache. Benchmark Report September 2011

Storage Update and Storage Best Practices for Microsoft Server Applications. Dennis Martin President, Demartek January 2009 Copyright 2009 Demartek

SFA12KX and Lustre Update

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

The Failure of SSDs. Adam Leventhal Senior Staff Engineer Sun Microsystems / Fishworks

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Seagate ExaScale HPC Storage

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Data Movement & Tiering with DMF 7

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

13th ANNUAL WORKSHOP Objects over RDMA. Jeff Inman, Will Vining, Garret Ransom. Los Alamos National Laboratory. March, 2017

DDN About Us Solving Large Enterprise and Web Scale Challenges

NVM Express over Fabrics Storage Solutions for Real-time Analytics

SMB 3.0 Performance Dan Lovinger Principal Architect Microsoft

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

A New Metric for Analyzing Storage System Performance Under Varied Workloads

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

The next step in Software-Defined Storage with Virtual SAN

Architecting For Availability, Performance & Networking With ScaleIO

Method to Establish a High Availability and High Performance Storage Array in a Green Environment

Chunling Wang, Dandan Wang, Yunpeng Chai, Chuanwen Wang and Diansen Sun Renmin University of China

NEC Express5800 Servers System Configuration Guide

Virtual Storage Tier and Beyond

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Campaign Storage. Peter Braam Co-founder & CEO Campaign Storage

CS252 S05. CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2. I/O performance measures. I/O performance measures

IBM Spectrum Scale IO performance

EMC VFCache. Performance. Intelligence. Protection. #VFCache. Copyright 2012 EMC Corporation. All rights reserved.

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

NetVault Backup Client and Server Sizing Guide 2.1

CS 261 Fall Mike Lam, Professor. Memory

IBM Storwize V7000 Unified

Chapter 10: Mass-Storage Systems

Storage Adapter Testing Report

CS 261 Fall Mike Lam, Professor. Memory

Hitachi Virtual Storage Platform Family

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

CMSC 424 Database design Lecture 12 Storage. Mihai Pop

I/O at the German Climate Computing Center (DKRZ)

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

NetVault Backup Client and Server Sizing Guide 3.0

CS370 Operating Systems

Design Concepts & Capacity Expansions of QNAP RAID 50/60

IBM DS8870 Release 7.0 Performance Update

Recent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

IBM System Storage DS8870 Release R7.3 Performance Update

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Deep Learning Performance and Cost Evaluation

Infinite Memory Engine Freedom from Filesystem Foibles

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

The QM2 PCIe Expansion Card Gives Your NAS a Performance Boost! QM2-2S QM2-2P QM2-2S10G1T QM2-2P10G1T

Copyright 2012 EMC Corporation. All rights reserved.

Efficient Object Storage Journaling in a Distributed Parallel File System

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Copyright 2012 EMC Corporation. All rights reserved.

CS370 Operating Systems

HP Z Turbo Drive G2 PCIe SSD

Lecture 18: Memory Systems. Spring 2018 Jason Tang

NexentaStor 5.x Reference Architecture

Parallel File Systems. John White Lawrence Berkeley National Lab

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

CS370 Operating Systems

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

Unveiling the new QM2 M.2 SSD/10GbE PCIe Expansion cards

Transcription:

Accelerating Spectrum Scale with a Intelligent IO Manager Ray Coetzee Pre-Sales Architect Seagate Systems Group, HPC 2017 Seagate, Inc. All Rights Reserved. 1

ClusterStor: Lustre, Spectrum Scale and Object Vertically Integrated: From the RAW media, to the fastest systems in the world L300/N Lustre System Secure Lustre G200,G300/N with ISS A200 - Object Store Up to 360 GB/s per rack Lustre 2.5 / 2.7 Up to 60 GB/s per rack Lustre 2.5 on SE-Linux Up to 360 GB/s per rack IBM SS 4.2 Tiered Archive More than 5 PB per rack CP-3584 Up to 84 x 8 TB drives Dual Controllers SP-3224 SP-3424 24 x 2.5 drives or SSDs Dual Controllers SP-3224 24 x 8 TB drives Dual Controllers SAS SSD Flash accelerators SATA NL SAS 10 TB 7.2K RPM HPC Drive 4TB 10K RPM SAS SSD 3.2 TB NVMe 1.3 TB PCIe x 16 NVMe 10 GB/s SMR Drive 8TB * File system performance (GB/s) per [HDD, RU, Enclosure, Rack.] 2017 Seagate, Inc. All Rights Reserved. 2

HPC I/O Storage Stack is Transitioning Memory Parallel File System Tape Archive 2017 Seagate, Inc. All Rights Reserved. 3

HPC I/O Storage Stack is Transitioning Adoption of Flash and Object Storage Expands The I/O stack 1 PB/sec Residence hours Overwritten continuous 1 s TB/sec Residence hours Overwritten hours Memory DRAM Burst Buffer Flash Tier OPS/Sec 100,000,000 s OPS/Sec 10,000,000 s 100 s GB/sec Residence days/weeks Flushed weeks Parallel File System OPS/Sec 1,000,000 s 10 GB/sec Residence months-year Flushed months-year Campaign Storage OPS/Sec 100,000 s 1s GB/sec (parallel tape) Residence forever Cloud Archive Tape Archive OPS/Sec 10,000 s 2017 Seagate, Inc. All Rights Reserved. 4

Flash Tiers come in many shapes Flash Acceleration Options Examples Server side Memory like AIC / SSDs 3DXpoint NVMe / Nytro / Data Warp / LROC Network attached Tiered flash IME, All Flash Array file system Enhanced Storage In File System Flash accelerated HDD tier AFA, SSD pools in FS, HAWC Seagate NytroXD, SSD based read cache All alternatives have advantages and drawbacks Most solutions have significant cost implications Actual value depends heavily on application capabilities Bottom line, the jury is still out.. 2017 Seagate, Inc. All Rights Reserved. 5

Storage side Flash Acceleration Seagate style 2017 Seagate, Inc. All Rights Reserved. 6

What s Possible With Just A Little Invisible Flash Pre-Alpha Results 2017 Seagate, Inc. All Rights Reserved. 7

Basic Nytro IO Manager Data Path 1. Incoming IO are profiled & filtered 2. Large IO to spindle & small to SSD 3. Write acknowledge once on media 4. Small IO s coalesced & flushed to spindle SSD s NSD Block Layer Nytro Filter Driver Nytro Caching Lib In Memory 2017 Seagate, Inc. Write acknowledged Write acknowledged Spectrum Scale Spinning Media All Rights Reserved. 8

The test bench All Spectrum Scale 4.2.3 Use 1GiB Pagepool prefetchaggressiveness=0 16x dual socket E5-2630 v3 @ 2.40GHz w/ 64GB (8x8GB) DDR4 Dataset sizes x1,x2,x3,x4 PP @ 3 iterations unless noted 2x GridRAID arrays, 40 NLSAS disks each All G300N Disk Drives have Write Cache Disabled GridRAID/HDD based NSDs use 32MB stripe cache 4x 1.6TB SSDs are RAID10, partitioned 50% system pool SSD s used: Random Read, 4KiB, QD32=200,000 IOPS <= 5us Random Write, 4KiB, QD32=80,000 IOPS <= 12.5us DWPD=10 2017 Seagate, Inc. All Rights Reserved. 9

How Many IOPS? First, define it, agree how to measure it Ask many people, get many different answers Second, what is actually needed? So many layers obscure the answer What does the file system client see from the application? Direct IO, vs. system call buffered IO (write), vs. buffered IO library (fwrite) What does the RAID device and individual disks see? Mixed, simultaneous use cases Advanced read-ahead mechanics DirectIO w/ HAWC These considerations ALL also apply when designing an IOPS test Answer changes with every new researcher, product, application, application version 2017 Seagate, Inc. All Rights Reserved. 10

16Nx16PPN 4KiB - No NXD 2017 Seagate, Inc. All Rights Reserved. 11

8Nx16PPN, DIO, Random 4KiB xfer, 128GB/N (x2 client memory!!!) w/ NXD Summary: api = POSIX test filename = /mnt/copperfs//scratch/1493933271.0/out access = file-per-process pattern = strided (2097152 segments) ordering in a file = random offsets ordering inter file= no tasks offsets clients = 128 (16 per node) repetitions = 1 xfersize = 4096 bytes blocksize = 4096 bytes aggregate filesize = 1024 GiB Start After Write Delta After W After Read Delta After R Cache Size in use 0 549755813888 549755813888 549755813888 0 Total number of IOs 2645424241 2784385958 138961717 2914353110 129967152 Number of reads 1507463413 1511715076 4251663 1641682228 129967152 Number of writes 1137960828 1272670882 134710054 1272670882 0 Total number of bypass IOs 655710496 656236614 526118 656236677 63 Number of bypass reads 235763879 419947641 184183762 419947704 63 Number of bypass writes 235763879 236288973 525094 236288973 0 Number of Cache Hits 1716621799 1720871402 4249603 1850806759 129935357 99.98% Number of Cache Misses 273091946 407277942 134185996 407309674 31732 Number of dirty CWs 0 0 0 0 0 Total Cache Blocks flushed 1017817720 1152002680 134184960 1152002680 0 Run began: Thu May 4 14:28:05 2017 Max Write: 323.74 MiB/sec (339.47 MB/sec) Run finished: Thu May 4 15:22:04 2017 Run began: Thu May 4 15:22:17 2017 Max Read: 1433.20 MiB/sec (1502.81 MB/sec) Run finished: Thu May 4 15:34:29 2017 mmperfmon gpfsnumberoperations 2017 Seagate, Inc. All Rights Reserved. 12

Sequential IO, RandomIO, Simultaneous, nonxd No NXD 2 Fill 8x16 8GiB/N 14:15:09 14:17:04 14:17:17 14:19:17 Sequential Write 8x1 1MiB 9590.25 MB/sec Sequential Read 8x1 1MiB 9169.70 MB/sec Rand 8x16 W 0.85 MB/sec Rand 8x16 R 12.10 MB/sec Rand 8x16 W 0.95 MB/sec Rand 8x16 R 14.02 MB/sec 14:15:10 14:16:02 14:16:12 14:17:02 14:17:18 14:18:10 14:18:20 14:19:11 mmperfmon gpfsnumberoperations 2017 Seagate, Inc. All Rights Reserved. 13

Sequential IO, RandomIO, Simultaneous, nonxd, cont. No NXD 2 Fill 8x16 8GiB/N 14:15:09 14:17:04 14:17:17 14:19:17 Sequential Write 8x1 1MiB 9590.25 MB/sec Sequential Read 8x1 1MiB 9169.70 MB/sec Rand 8x16 W 0.85 MB/sec Rand 8x16 R 12.10 MB/sec Rand 8x16 W 0.95 MB/sec Rand 8x16 R 14.02 MB/sec 14:15:10 14:16:02 14:16:12 14:17:02 14:17:18 14:18:10 14:18:20 14:19:11 mmperfmon gpfs_nsdfs_write_ops,gpfs_nsdfs_read_ops 2017 Seagate, Inc. All Rights Reserved. 14

Sequential IO, RandomIO, Simultaneous, NXD 08:09:53 08:12:00 08:12:14 08:14:34 NXD Test 5 Sequential Write 8x1 1MiB 8611.76 MB/sec Sequential Read 8x1 1MiB 7877.00 MB/sec Fill 8x16 8GiB/N Rand 8x16 W 90,375 ops/s Rand 8x16 R 315,439 ops/s Rand 8x16 W 103,767 ops/s Rand 8x16 R 332,587 ops/s 08:09:54 08:10:57 08:11:08 08:12:02 08:12:15 08:13:35 08:13:48 08:14:38 mmperfmon gpfsnumberoperations 2017 Seagate, Inc. All Rights Reserved. 15

Sequential IO, RandomIO, Simultaneous, NXD, cont... 08:09:53 08:12:00 08:12:14 08:14:34 NXD Test 5 Sequential Write 8x1 1MiB 8611.76 MB/sec Sequential Read 8x1 1MiB 7877.00 MB/sec Fill 8x16 8GiB/N Rand 8x16 W 90,375 ops/s Rand 8x16 R 315,439 ops/s Rand 8x16 W 103,767 ops/s Rand 8x16 R 332,587 ops/s 08:09:54 08:10:57 08:11:08 08:12:02 08:12:15 08:13:35 08:13:48 08:14:38 mmperfmon gpfs_nsdfs_write_ops,gpfs_nsdfs_read_ops 2017 Seagate, Inc. All Rights Reserved. 16

Benefits of Nytro The benefit of this method is 1. The penalties of read-modify-writes are greatly reduced. 2. It reduces the need for ILM to move data between pools, 3. IO gets served by the most appropriate media type 4. Operates on much larger IO sizes than HAWC (<=1MB) 5. Can address larger SSD pools than LROC (~76TB). 6. Can be enabled/disabled and tuned without changing the filesystem. 7. Mitigates the financial impact of going all flash. 2017 Seagate, Inc. All Rights Reserved. 17

Monitoring Lessons Learnt Is it actually random/small/fast? Nytro Histogram (profile IO size only) mmpmon (not granular enough) mmperfmon (good but cant confirm IO type) --iohist On NSD Server (prone to dropping metrics) Device [md300] Zero Sector Seeks: 0 Non-zero Sector Seeks: 64976 Percent Zero Sector Seeks: 0.00 Device [md300] == [data] == [W] Total Operations: 64976 Avg Duration (ms): 127.41 Avg Size: 8.00 Percent W: 100.00 Rate: 0.03 MiB/s === Num Sectors Histogram === Count: 64976 Range: 8.000-16.000; Mean: 8.000; Median: 8.000; Stddev: 0.044 Percentiles: 90th: 8.000; 95th: 8.000; 99th: 8.000 8.000-8.591: 64974 ##################################################### 8.591-16.000: 2 Nytro Histogram : ========= Num Reads < 4K = 0 Num Reads 4K = 161393 Num Reads 4K+1-8K = 1 Num Reads 8K+1-16K = 0 Num Reads 16K+1-32K = 0 Num Reads 32K+1-64K = 0 Num Reads 64K+1-128K = 0 Num Reads 128K+1-256K = 0 Num Reads 256K+1-512K = 0 Num Reads 512K+1-1M = 604492 Num Reads 1M+1-2M = 0 Num Reads 2M+1-4M = 0 Num Reads 4M+1-8M = 0 Num Reads 8M+1-16M = 0 Num Reads 16M+1-32M = 0 Num Writes < 4K = 0 Num Writes 4K = 11309 Num Writes 4K+1-8K = 0 Num Writes 8K+1-16K = 0 Num Writes 16K+1-32K = 0 Num Writes 32K+1-64K = 0 Num Writes 64K+1-128K = 0 Num Writes 128K+1-256K = 0 Num Writes 256K+1-512K = 0 Num Writes 512K+1-1M = 525490 Num Writes 1M+1-2M = 0 Num Writes 2M+1-4M = 0 Num Writes 4M+1-8M = 0 Num Writes 8M+1-16M = 0 Num Writes 16M+1-32M = 0 2017 Seagate, Inc. All Rights Reserved. 18